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COMPRESSED DATABASE WITH SELECTION OF COMPACTION METHODS 

Background of the Invention 
This invention relates to databases. In particular, this invention relates to an improved database storage 
structure. 

5 A database is a coBection of biterrelated data, typically stored on a computer, to serve multiple applications. 

Data in a database is logicaDy represented as a collection of one or more tables, composed of a series of rows and 
columns. Each row in the table, called a record, represents a coOectbn of related data. Each cohmm m the table, 
called a field, represents a particular type of data. Thus, each row is composed of a series of data values, one data 
value from each field. 

10 A database is organized from two perspecthres. A logical perspective descrbes how the database is 

organized from a user's vtewpoint. A physical perspecthre describes how data is actually recorded in computer 
storage. The prior art describes various techniques designed to alter the physical organization of a database while 
mamtaining the same logical perspective, in order to reduce computer storage requirements. One technique, data 
compression, b weB-known in the prior art Peter Abberg's paper, "Space and Tone Savings Through Large Database 

15 Compression and Dynamb Restructuring," Proceedbigs of the IEE& VoL 63, No. 8, August 1975, pp. 1114-1122, 
descrbes the use of binary number codes to represent data consisting of character strmgs. Alsberg noted that when 
specific data values occur repeatedly, it is feasble to use a variable-length compression code for that field. Usmg 
shorter codes for the frequently occurring data values and longer codes for the infrequently occurring data values 
achieves greater compression. The paper by Dennis Severance, "A Practitioner's Guide to Database Compression," 

20 hformation Systems, VoL 8, No. 1, 1983, pp. 51-82, describes a sinular binary encoding compression scheme. 
Severance outEnes a method where the data values are ordered by probabifity of occurrence and then assigned a 
variable-biMength binary code using Huffman coding, a weB-known optimum code for this purpose. 

Besides data compressbn, pattern recognition can also be used to reduce the data storage requirements 
of a database virithout altermg the logical organization of the database. In Fred McFadden's book, "Database 

25 Management," 1983, Benjamm/Cumnrings PubEshing Company, a technique called pattern substitution is described. 
This technique first identifies repeating sequences of characters that occur virithin a particular field, then replaces 
these sequences of characters by a single character which represents the pattern. 

Besides reducing data storage size, other techniques for altering the physical organizatton of a database 
to allow faster data access have been described. For example, data records may be grouped in a way which aDows 

30 data herns which are accessed more frequently to be stored on the fastest storage devices. This can be achieved 
by splitting the stored records into separate segments and aflocating separate segments to separate physical storage 
devices, some of which permit faster data access than others. As another example, records can be pbysbaDy 
grouped together if they are frequently accessed together, such as grouping records on the same disk sector or disk 
track bi this manner, fewer disk accesses are needed to transfer data to or from the main computer memory for 

35 a particular application. Thb technique is described in McFadden's book, "Database Management," cited above. 
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The prior art also describes techniques for ahering the logical organization of a database in order to create 
a more efficient physical organization. As an example, W. Kent, m his paper, "Choices in Practical Data Design," 
Proceedings: Very Lerge Data Bases, September 8-10, 1982, pp. 165-180, descrflies field design options. Kent 
describes alternat'nre representations of the same data relationshqis. Each specific data type can be represented by 
5 e separate field or grouped wKh other data types into a combined field, in Kent's article, data types are either 
combined into a nngle field or separated into distinct fields based on the known relationships between the data 
types. For exanqile, "date" can be a single field, or date information can be represented by three separate fields: 
"month," "day" and "year." 

Summary of the Invention 

10 The present invention is directed to an onproved physical structure for databases which are logically 

composed of e series of rows and colimtns and a method for creating the structure. A conventional database 
typically makes inefficient use of computer storage space and requires a significant amount of data access time. 
By redesignnig the physical structure of such a database, the same information can be recorded in less space and 
can be accessed faster, hi accordance with the present mvention, this redesign is accompGshed, m part, by the 

15 automatic selection or rqection of one or more compaction methods to be appfied to mdividual database fields, h 
a preferred embodiment, there are five compaction methods which can be appFied to the database fields: stogle-field 
encoding, multiple-field combinatfain, pattern suppresaon, numerical substitution and text compressnin. A particular 
compaction method is appFied to a partmular fieU only if die benefit from storage savings is sufficiently large to 
offset the penalty in overtiead storage end increased storage complexity. The overiiead storage b mostly due to 

20 translation tables needed to convert conqiacted data back to original form. A favorable tradeoff between storage 
savings and overiiead is achieved by setting criteria for each compaction metiiod which must be satisfied by the data 
within a particular field. That h, various data-dependent characteristics are calculated for each field, and these field 
characteristics ere then compared to specific compaction method criteria to determine which compaction methods 
are applicable to which fields. In this manner, a system is provided whereby a compaction method or methods for 

25 each field are automatically selected to yield favorable results. The method or metiiods are then applied to each 
record of the database to create equhralent compacted records, requ'irmg less storage. 

hi accordance with enother aspect of tins invention, data access performance can be improved over a 
conventional database by a method of grouping compacted records according to record lengtfi, where record length 
is the number of storage bytes required for a compacted recori Each equal record-length group is written to a 

30 common storage area, caDed a partition. This results in a "database image" containmg nniltiple partitions, each 
partition containing records of the same record length. This same-length record structure sbnpBfies the data 
processing required to read, modify, delete and add records and speeds database access. In accordance with a 
further espect of this invention, various compactkm metiiods convert data vahies into equhralent compacted data 
values wMch reqtrire less storage space. The single-field encoding method accomprishes this by assigning a code to 

35 each different data value m a field. A field whuA can be encoded is also a candidate for the multipb-field combining 
method. The multiple-field combining method confines two or more fields into a single fieU by assigning a unique 
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code to each different combination of data values which occur in a single record, within the fields to be combined. 
The code assigned to a data vahie combination is a reduced storage equivalent of the data vahie combination. The 
data value combination can be reconstructed from the code, and the code requires less storage space m the 
database. That is, the codes generated by the multiple-fteld combining method become compacted data vahies that 
5 replace each of the data vahja combinations from the fields to be combned, creating a smgle combined field. 

A field which cannot be encoded might be compacted by the pattern suppression method. For pattern 
suppression, a smgia character repeatedly occurring in the same position within a frald is identified as a pattern. 
Also, muhqile characters that repeatedly occur together in a single record in the same positions within a field are 
identified as a pattern. These characters need not be contiguous with'm the field. A pattern b removed from each 

10 record in which it occurs and a designation of the pattern removed is associated with the record. In a preferred 
embodiment, a 'type byte" at the end of each compacted record, used to specify the storege method for each field, 
B used to desqnate which patterns have been removed from each field. Tbis saves the storage space required by 
tiie pattern and there is no need to use a substituted character. 

A method of encoding using nitegral*byte length codes is used in the single-field encodbig and multiple-field 

15 combination conqiaction methods. Hence, an item to be encoded can be a data vahie (smgle-field encoding) or a data 
vahie conAination (midtqile-field combination). Encoding begins by selection of a field to be encoded. A field may 
be a combined field contauinig data vahie combinations as a resuh of the multiple-fieM combining. The different 
items to be encoded are identified, and the relative frequency of occurrence of the items within the field is 
determmed. A unique variable-byte length code is then assqned to each different itemr where the more frequently 

20 occurring items are assigned smaller codes. The codes then replace all occurrences of the items m the field to be 
encoded. 

It should be noted that run-iength encoding schemes, such as Huffman codeSr have variable-bit length codes. 
The database records may consist of many fields. If a number of these fields are encoded using a variable-bit length 
code, the overall bit length or storage length of each record will vary widely. The partitiorang of contacted records 
25 is unpractical if the number of possible reconi lengtiis is not lim'ited. Hence, a tradeoff exists between the optimum 
compactfon achieved by variable-bit length codes and the resuhnig widely variable record lengths, fai accordance with 
a preferred embodiment of tiie present invention, this tradeoff is accompBshed by using a variable-length code of 
intqrat-byte fength. 

Another aspect of this invention relates to a database structure which inchides a database image and 
30 vartous tables for uncompacting the data m the database nnage. The database image is made up of partitions. Each 
partition contains compacted records of the same tength. Within each partition are subpartitions of compacted 
records using the same storage configurations for each field hi the records. The compacted records are all of 
integral-byte length and contan the compacted data values created by the various compaction methods. This, 
database structure resides on a readable computer storage medium, and b a reduced storage version of a 
35 conventional database. 
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A further aspect of this invention relates to a complete database system which is implemented on a 
conventional computer system using the database structure descrbed above. The database fanage resides in the 
computer system's mass storage, and the tables reside in the computer system's random access memory. A 
conventional operatmg system controls conventional computer system functions such as program execution and 
S input/output An access program is required to interpret and perform user requests such as data readSr insertions, 
deletions and updates. 

Further features and advantages of the present inventten wiB become apparent from the detailed description 
of preferred embodiments which follow, when taken together with the appended figures and ctaans. 

Brief Description of the Draw'wos 
10 Figures 1A-E show an example of a database from a logical perspecthre and Figures 1F-I show the results 

of epplying an embodiment of the present invention to this database; 

Figure 2A-B are an illustration showing a physical data structure for a database u»ng the example database 
of Figures 1A-E; 

Figure' 3 is an illustration of the physical data structure of en embodiment of the database image according 
15 to the present bivention; 

Figure 4 is a bbck diagram showing an embodiment of a computer-implemented database system according 
to the present nivention; 

Figure 5 is a functional flow diagram showing an embodiment of the database image build process according 
to the present invention; 

20 Figures 6A-E are tables summarizing compaction methods, preferred field characteristics and compaction 

criteria accordmg to the present bivention; 

Figure 7 is a top-level flowchart of a database bnage build process *n accordance with a preferred 
embodiment of the inventten; 

Figure 8 is a top4evel flowchart of a preferred field characteristic test process used to determme which 
25 Mis are amenable to particular compaction methods; 

Figure 9 is a detaHed flowchart of a preferred field characteristic test pertaining to the smgle-field encoding 
compaction method; 

F«ure 10 b a detailed flowchart of a preferred field characteristic test pertaming to the multq)le*field 
combinmg compaction method; 
30 Figures 11A-E show a detailed flowchart of a preferred field characteristic test pertaining to the pattern 

suppressran compaction method; 

Figure 12 is a deta9ed ftowchart of a preferred field charactertetic test pertaining to the numeric 
substitutton compaction method; 

Figure 13 is a detailed flowchart of a preferred field characteristic test pertabring to the text compression 
35 compaction method; 
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Figure 14 is a detailed flowchart of a preferred translation tabte creation process applicable to single*fleld 
encoding and to muhiple-field combining compaction methods; 

Figures 15A-B show a detailed flowchart of a preferred text parsing process appficable to the text 
compression compaction method; 
5 Hgures 16A-C show a detailed flowchart of a preferred process for creating compacted records for the 

database image; 

Figure 17 is a detailed flowchart of a preferred pattern suppression process; 

FQure 18A is a detailed flowchart of a preferred text compression process; 

Figure 18B is a chart of a preferred text compression encoding scheme; 
10 Figure 19 is a detailed flowchart for a preferred code determ'mation process; 

F^ures 20A-B show a detailed flowchart of a preferred process for generating the integral-byte length 
codes, inchiding smgle-byte, variable-byte and combination codes; 

Figures 21A-B show a detailed flowchart of a preferred storage partitiom'ng process advantageously used 
to produce a database nnage; 
15 Figures 22A-C show a detailed flowchart of a preferred process for creating the index access structure for 

the database imqe; 

FQure 23A b a functional flow diagram and tables aiustrating an embodiment of the complete physical data 
structure acconfing to the present invention; 

Figures 23B-N show examples of the tables referenced in Figure 23A; 
20 Figure 23M is a table of a storage method identification scheme; 

Fqure 24 is a top-level flowchart of a preferred database image access process for data reads, data deletes 
and data updates; 

Figures 25A-B show a detailed flowchart of a preferred data compaction subprocess used 'm the access 
process of Fqure 23A; 

25 Fqures 26A-B show a detaSed flowchart of a preferred data search subprocess used in the access process 

of Rgure 23A; 

Figure 27 b a detailed flowchart of a preferred record processing subprocess used in the access process 
of Figuro 23A; 

Rgure 28 is a detailed flowchart of a preferred access process for data updates; and 
30 Figure 29 is a detailed flowchart of a preferred access process for data niserts* 

Descrintion of the Preferred Embodiments 
in accordance with the present invention, a database image is created from a database logically organiied 
as one or more tables of rows and columns. Shown m HGS. 1A-E are tabbs for the database "COMPANY." Each 
table contains a coBection of related data. For example, the table "EMPLOYEEJNFO/ contains aB data related to 
35 Company employees. Each row of a table contains data on one specific member of the related data collection. For 
example, row 1 of the table "EMPLOYEE JNFO" contains aB data descrying employee Arnold Benjanrin. The cohimns 
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of a particular table contain a specific type of data common to the entire data collection. For example, the column 
"SSN" contains the social security numbers for aD employees. The intersection of a particular row and cohimn 
contains a data value describuig a very specific aspect of one member of the related data collection. For example, 
cohmm "EID" of row 1 of table "EMPLOYEE JNFO" specifies "the empbyee identification number of Arnold 
5 Benjamin." The associated data value is the character string "10839." Tables like FIGS. 1 A E represent the logical 
or human perspective of a database. There are various ways to togically represent the same data relationships. 
FIGS. 1A-E show multiple, interrelated tables. For stmpBcity, the preferred embodiments of the database structure 
and method are described here in terms of a single table. These operations can be extended to mult'qile tables in 
a manner described below. 

10 An example of the type of physical data structure which mqht be used by a conventional database is 

shown m R6. 1 FIG. 2 corresponds to the logical database structure of the EID_WRK,ASSGN table of FIG. 1C. 
In this particular example, data records, which correspond to the table rows, are ordered numerically by a particular 
field, which corresponds to a table column. The fieU used for the ordering is "EID," and the records are stored 
sequentially in blocks of disk storage. The lengtii of each record, that is, the amount of storage the record requires, 

15 varies from record to record. The means of storage of data in the orighal database from whfadi the database image 
is created is unimportant For example, the data may be stored in a user specific file or in a commercial database. 
The datsbase may already be in existence, or data can be directly input into a preprocessor which compiles data 
and creates a database image when sufficient data is avaSable. The nature of the database fields is also 
unimportant. These may be numeric or alphanumeric, fixed form or free form. The database, however, must be 

20 capabte of being togically organced in a row-cohimn format. 

The physical structure of the database imager according to a preferred embodiment of the present invention 
is shown in FIG. 3. Additional data structures, which include translation, auxiGary and access tables, are described 
betow. The database nnage consists of compacted records, each of which b a reduced storage version of a 
database record. The compacted records contain fields which may be fewer in number than timse m tiie original 

25 database due to field-combinetions during the database image buOd. The data witMn a particular field of tiie 
database image may be cheracter strings, binary codes, bbiary numbers or e combination of these. The detabase 
image records are organized into computer-storage partitbns, where each partition contains records of equal length. 

FIG. 4 illustrates a preferred embodiment of the database structure, wluch mchides the database nnage, 
as configured for a typical computer system. The buiM subsystem 10 creates a database image 12 in mass storage 

30 14 and a variety of tables in random access memory (RAM) 16. These tables include translation tables 20, auxiliary 
tables 22 and access tables 24. The build subsystem consists of a database image build processor 26. The 
database imege buiU processor 26 assumes the existence of a database 30 from which to create the database image 
12. OptranaBy, the database nnage buDd processor 2B is interfaced to a build preprocessor 32 used for constructmg 
an initial database 30 if none exists. Altemativdy, a conventional database management system (DBMS) 34 can 

35 be used to initially construct a database 30. The database image build processor 26 need not remain on the system 
after the initial buOd is complete. A rebuild, however, may be necessary after many data insertions, deletions and 
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updates are done on the database image. Such operations may eventually change the frequency of occurrence of 
data values hi a fidd to the extent that the order of data vahies in the corresponding translation table is incorrect 
(i^e. some of the more frequently occurrmg data vahies are assigned greater length codes). This may result in 
degraded performance characteristics for the database image. The database image build processor 26 would be 
5 needed to perform a rebuild in these circumstances. 

RB. 4 also flhistrates the access substructure for the database. A user mterface b achieved through 
mput/output devices 36. The access subsystem 40 provides the means of accessing the database image 12, the 
translation tables 20, auxSary tables 22, and access tables 24 and transferring data to and from the teput/output 
devffies 36. The translation tables 20 consist of single-fnU encodmg translation tables, multiple-field encoding 

10 translation tables, and test field translation tables. The auxifiary tables 22 consist of the record information table, 
pattern information tables, the mostly numeric fields table, and tiie combined fields table. The access tables 24 
consist of bidex tables and page tables. An essential part of the access subsystem 40 is the database image access 
processor 42. Optionally, the database image access processor 42 is mterf aced to an access language preprocessor 
44 which provides the database nnage user with a standanfized data access language, such as SQL (Structured 

IS Query Language). As part of the access subsystem 40, tiie DBMS 34 mqht remain on the system to operate on 
the database 30 n parallel to the database image access processor 42 operating on the database image 12. hi this 
mode, die database image access processor 42 might serve as a background process for the DBMS 34, providing 
quick access for routine data queries while the DBMS 34 provUes more sophisticated data reports. Altematively, 
the database nnage 12 might provide a backup role, providing the capabifity of regeneratmg the database 30 in the 

20 event of a system faihire. 

Although the database image 12 typically resides in mass storage 14, if smaD enough, the database image 
may reside ni RAM IB. Translation tables 20, auxifiary tables 22 and access tables 24, which are critical to fast 
data access, are preferably stored in RAM 18, but if too large for RAM IB, the translation tables 20 might be spGt 
betwem RAM 18 end mass storage 14. bi that case, translation table entries corresponding to the most frequently 

25 occurring data would still reside in RAM IB. 

Also shown in RG. 4 is the CPU 46 which provMes the hardware processing function for all database 
operations. TUs processbig function is generaBy under the control of the operating system 50. The input/output 
devices 38 mchide such devices es a keyboard, printer, end cathode-ray tube (CRT). The mass storage 14 would 
mclude such devices as bard disk, ftoppy disk or compact disk • read only memory (CD-ROM). 

30 FIG. 5 shows that the database image is buBt by selectively compacting database fields. A particular 

compactkm method is selected 54 by comparing field characteristics 56 to compaction criteria 60. The field 
characteristics S8 are derived from a specific database 62 end are features of the data in a particular field that 
determine whether a particular compacthm method 54 w3 be benefmiaL The criteria 60 are conqiaction-metiiod 
dependent bnits placed on qiecrfic field characteristics 56. A particular compaction method wiB be applied 52 to 

35 a particular field only if the field characteristics 56 satisfy the criteria 60 for applying that compaction method. 
Hve compaction metiiods 54 are possibly applied 52 to each fieU in the database 62: single-field encoding, midtiple- 
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field combimng, pattern suppression, numeric substitution and text compression. More than one compaction method 
54 may be applied 52 to the same field. By selecthrely compactmg 52, a favorable tradeoff is achieved between 
data compaction and overhead storage dua to the compaction meihod itself. The build conrguration parameters B4 
are constants which are ghren a value by the user or are ghren a default vahie and which affect the criteria 60 
5 described above. 

FIG. 6 summarizes the build configuration parameters, field characteristics and compaction criteria used for 
each compaction method. The preferred values for these constant^r given in H6. 6, are starting points for creating 
a database image. After the database mnage b buit usmg these mitial criteria, there b nothing to prevent aheratmn 
of the criteria and one or more rebuilds of the database rniage to determine if overall compaction can be improved. 
10 One of ordinary skiH m the art could easiy achieve desired tradeoffs in a specific database application by using other 
buBd configuration parameters or other criterb that may work as weD or possibly even better than those ghen in 
H6. 6. It b contemplated that such other build configuratum parameters or other criteria f aD virithin the scope of 
the present invention. 

Referring back to H6. 5, compactmg 52 creates translation and auxfliary tabbs 66 used to create 

15 compacted records 70 and to interpret database image data. The database image 72 b generated by partrtbnng 
74, which groups compacted records according to the number of bytes of compacted data they contain. Each group 
b assigned a separate area of computer storage, caBed a "partitbn." 

A method of accessing the data to the database image b abo requked. Thb bivohres constructbg an index 
access structure, bidexes are additional data structures which make the data searches more of ficbnt by referencing 

20 certain fields, calted indexing fields. For exampb, in the database tabb of FIG. 1A, the fieM containing last names 
may be made an indexing field. Thb field would be sorted alphabetbaOy and pobters to records containing each 
bst name would be created. In thb manner, records contambg a specific last name could be qubkly bcated without 
searching the entire database. The database image partitioning structure easily accommodates indexbg because 
same-length records are grouped together, aHowmg simpta algorithms to be used to locate mdividual records. 

25 Referring back to FIG. 5, a preferred end)odiment uses mdex'mg 78 to create access tabbs 80 consbtmg 

of mdex and page tables for each specified mdexmg fsU. The access tables 80 are derived from the database bnage 
72 and the transbtion tables 66. The mdex tabb con&ts of a page pobter associated with each index value, 
where an index value b a unique data vabe in the indexing field. A page pointer specifies a block of entries in the 
page table. Each page tabb entry b a record pointer. A record pobter spedfbs a partitbn number and a relative 

30 bcation within the partitbn where a record b bcated. In summary, specifying an mdex vabe specifies a page 
pointer which, via blocks of record pointers b the page table^ specif bs the bcatbn of records contambg thb mdex 
vabe withm the database image. The database biage 72r transbtion and auxSary tables 66 and access tabbs 80 
form the reconfigured physbal structure of the origbal database 62. 

As shown m FIG. 7, the database image build 82 begms by reading the build confQuration parameters 84 

35 and then testing field characteristics and sebctbg compaction methods 85. Then, transbtion tabbs are created SO 
for fblds marked for encoding 84 or fblds marked for text compression 102. Next compacted records are created 
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108 and grouped into partitions 110 to form the database mage. The final build step creates en access structure 
for the database image by indexing 111, completing the build process 112. 

Referrmg to FIG. 8, the field characteristic test process begins by selectmg one of five tests 114. Each 
of the fnre tests, encoding 116, combining 120, pattern 122, numeric 124 and text 126 correspond to the five 
5 compaction methods. For each test the field characteristics for the specif n: compaction method are calculated, these 
characteristics are compared with the eppEcabb compaction criteria and tfie fields satisfyhig tfie criteria are marked 
for compaction by that method. The encoifing test 116 detennines whteh fidds have data vahies which can feasibly 
be replaced by a code. The combining test 120 determines which encoded fields are suff^tly related with other 
fields to be combmed faito a single field. The pattern test 122 determmes which fnUs contain one or more patterns 
10 which can be removed from the data values containrng patterns. The numeric test 124 detemdnes which fields 
contsin mostiy mimeric data and tiie text test 126 determines which fields contain text whhdi can be compressed. 
After it is determ'med that aD tests have been performed on all fields 130, tiie field characteristic test procedure is 
finished 131. 

Single-field encoding substitutes a numeric code for each data vahe in a partnnlar field, witii shorter-lengdi 

1 5 codes substituted for the most frequentiy occurring data values. For ease of description, the records containing each 
data value of a field can be envisioned as bemg sorted nito cubbyholes or "slotSr" where there is a slot allocated 
for each different data vahie occurring in tite field. A record is envisioned as bemg placed in a slot if the record 
contms tiie data vahie associated widi that slot. The singte-field encodfaig metiiod can then be descrhed as 
essignmg a numeric equhratent to tiiese slots, representmg the numeric equhrabnt with a unque lunary code and 

20 substituting the code for tiie data vahie in every record contained in the slot A translation table is constructed 
which equates data values witii numeric equivalents. 

Single-field encoding is performed on a field if the field's characteristic satisfies die encoding criterion. The 
relevant field characteristic for tingle feld encoding b die number of slots, or different data vahies, in the field. 
The corresponding criterion to satisfy is a maximum allowed number of slots. There b a transbtion tabb entry for 

25 each umque code assqned to each slot and the single-fbid encodfaig criterbn ensures diet storage overiiead resulting 
from translation tabb size does not cancel the compaction benefit from encodbg. 

FI6. 1C illustrates, by exampb, die concept of sbts, records contabed m sbu and numeric equhrabnts. 
In FIG. 1C, die table 'EID^WRK^ASSGN' has a field "WO' contabing woric order codes. The records for work 
orders from BC00009 dirough BC00014, BR00015 and BR00016 are mchided b tiib fqure. There are four records 

30 m die slot BC00012, diree records each ta sbts BC00011 and BR00015, and two records each m slots BC00009, 
BC00010, BC00013, BC00014, and BR(K)016. A numerb equivabnt b sequentially assped to each sbt in 
decreasbg order of the number of records b a sbt. Assuming the sampb shown m FIG. 1C of the 
EID_WRK_ASS6N tebb represents all records, the resultbg transbtion tabb would be as shown n FIG. IF. To 
compbte the smgb-fieM encodbg process, diese numeric eqmvabnts would be represented widi bmary codes. 

35 Shorter-bngth codes couU be used to represent the smaDer value numeric equivabnts. 
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In the example, when a suffictent number of records are associated whh work order$, the entries in the 
WO field can be replaced with an equivalent bmary code. The condition where many employees are ass'gned to work 
on a specific job is common in several mdustries. Thus, in practice, the number of records in these work order slots 
couU be significantly higher than demonstrated h FIG. 1C, as denoted by the dashed ine at the bottom of the table. 
5 In such cases, replacement of the work order value m the EID_WRK_ASS6N table with a bmary code is justified. 
If a database is logically represented as multqilOr mterreiated database tables, the translation table for a 
fteid which is common to several database tables can be handled in a variety of ways. The fteld may have a 
translation table for each occurrence in a database table, a smgle translation table which covers al occurrences or 
a few translation tables, some of which are multiply used by several database tables. Multipb-usage depends on 

10 a ghren translation table meeting the criteria for several database tables. 

For single-field encoding, the relevant field characteristic as shown in R6. 6, is the number of different 
data vahies, or slotSr in the fieU. Also shown in FIG. 6 is the compaction criteria for smgle-field encoding, which 
is the maximum allowed number of slots. This criteria Gnuts the number of entries in the corresponding translation 
table. In a preferred anbodbnent, this maximum equals C, x (number of records)^, where C| b a biald configuration 

15 parameter. With this criteria, the translation table size grows with larger databases, but not in direct proportion to 
the size of the database. In a preferred embodiment, Ci equals'31.623. This vahie sets the translatkin table size 
for a 100,000 record database to 10,000 entries or 10 percent of the total number of database records. For a 100 
nulon record database, the translation table size would be 316,228 entries or 0.3 percent of the database size. 
Other vahies for the build conf guration parameter Ct can be used in this formula, and other completely different 

20 formulas can be used to echieve different tradeoffs. For example, the single*fsld encoding criterion rnqht be made 
proportional to the number of bytes in system RAM, where the translation tables would normally reside. As another 
example, this criterion might be made dependent on the number of fields in the detabase. One of ordinary skill in 
the art will recognize that the criteria design for single field encodmg may be modified in a number of ways, 
dependent on the particular system resources and the particular application. 

25 As shown in FIG. 9, the encoding test 132 begbis by selecting a particular fbld 134, readnig the data 

vahies from the database records 136 and computing the number of stots 140, that is, the number of different data 
values in the field. At this point the fieU cbaracteristfc for the single-field encoifing method has been determined. 
This characteristk: is compared to the criterion 142, a maxbnum. If the number of sbts » less than this maximumr 
then this f»ld is marked for encodmg 144. Otherwise, the field is rejected for this compaction method, if a field 

30 is rejected for ^gle-field encoding, it is tested to see if the partial encoding criterion is met 146. If so, the fieM 
is marked for partial encodmg ISO. When all fields are tested in this manner 152, the encoding test process is 
complete 153. 

Partial encodhg associates codes for most of a field's data values and leaves the remuntng data values 
unencoded. Partial encoding is feasible if the largest slots, those conta'uung the most records, account for a 
35 signrf leant portion of the total number of records m the database, iience, the fieU characteristic for partial encoding, 
as shown in FIG. 6, is the total number of records contained in the first largest slots. In a preferred embodiment. 



wo 95/32477 PCT/US95;03901 

.11- 

Cj is 256. This value of Cj is chosen because a one byte code can then be associated with these largest slots. 
Also shown in R6. 6 is the partial encoding criterion, a minimum mmiber of records. bi a preferred embodiment 
this criterion is C, k (total number of records in the database). That is» the C2 largest slots fai a field must contaoi 
a minimum fraction, C> of all records in the database. In a preferred embodbnent C3 is 0.33. This vahie is chosen 
5 because one-third is a large enough fraction of the database to make partial encoding overall worthwhile. Other 
vahies for and C) can be used. For example, can ba set to 512. Then the first and second bytes of the 
single-byte code could be used for encoding. Thus, using the same one-byte code length and using only a 512 entry 
translation table, the Ekelihood that a field is partially encoded is increased. Another example which increases the 
likelihood that a field rejected for encoding would be partially encoded b to tower C) to 025. Then the Cj largest 

10 slots would only have to contain one fourth of the total database records. For a large database, this may still result 
m significant compaction. The partial encoding criterion need not be a constant. For example, the criterion may be 
made proportional to: (total number of database records)^^. bi this manner, the total number of partially encoded 
records does not have to increase in dkect proportion to the database wi. As noted above with respect to the 
singte-field encoding criterion, one of ordmary skOI in the art will recognize other vahies for and C,, other 

15 characteristics and other criteria may be chosen to achieve other system tradeoffs. 

The multiple-field combining method creates a smgle fieU from two or more fields by substitutkm of a 
numeric code for each data vahie combmation occurring m two or more f ieUs within the same recorl Shorter-length 
codes are substituted for the most frequently occurring data vahie combinations. For ease of description, the records 
containing each data value combination can be envisioned as bemg sorted bito stots, where there is a slot allocated 

20 for each different data value combuiatwn occurring in the same record^ withm the fields to be combined. The 
multiple-fieU comlnning method can then be described as assgnmg a code to each slot and substitutmg the slot code 
mto every record contained in the slot Combining fields, however, may result in so many different slots that the 
combined field cannot practically be encoded. The tradeoff invohred is sbnBar to that for single-field encoding 
discussed above • compaction benefit versus the storage penalty due to the translation table. PotentmDy, the number 

25 of different slots from two fields is the smaller of: (1) the product of the mmiber of slots for each indivnlual field; 
or (2) the total number of records m the database. Combming two or more fields mto one compacted field, however, 
is practical if these f ieMs are sufficientiy related. Relds are related, in this context, when there n a relatively Gmited 
number of different data value combinations or slots for these fields. For examp^ to a database consisting of the 
physical characteristics of individual persons, "Height" and "Weight" might be two fields, consistmg of the recorded 

3D todwidual heights and weights, respectively. The possible combinations of recorded height and weight are numerous. 
However, there is a strong relationship between height and weight It is unlikely that there are any todhriduals to 
the slot "5 feet tail, 200 pounds'* or the slot "6 feet tail, 90 pounds." Therefore, if these fieUs were combined toto 
a field "Height/Weight" the resultmg number of slots is Skely to be such that an todhriduars recorded he'qht and 
weight could be represented by a smgle code, to a preferred embodtoient of the present tovention, the field 

35 characteristics used to determine if two fields are sufficiently related to combme are based on record per slot 
statistics. Specifically, the characteristics for a field combmation are the mean, |i, and the standard deviation, a, 
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of the number of records per sbt. The corresponding criteria are: // > 612: /j a > ft and // > <r, where the 
threshold & is a function of database size. Together, these three criteria restrict tte distributkin of records per slot. 
Field pairs are only combined if the greatest concentration of records in the fewest sbts result Thb simpfifies coding 
because shorter codes are used for the largest slots, as noted below. The multiple-rield combination characteristics 
5 are tested two fmlds at a tbne, for aB possible field pavs in the database. However, overlapping fteld pairs may 
satisfy the criteria, so a figurenif-merit is used to decide wMch of these field pairs to combine. The figure-of-merit, 
r - ^ <•> a, is a measure of the concentration of records per slot and is computed for each field pair which 
satisfies the criteria. Pairs having the largest f are combined, maximiang the total T After fields are combined, 
repeated "passes" must be made to determine if these new combined fields can be combined with other fields. 

10 For multiple-field combination, a pair of fields would be combined and encoded as a single field if the 

number of records per slot for the two fields is suffkiently concentrated. Unfortunately, it b impractical to simply 
determine the number of different data vahie peirs, or slots, for every pair of fields m the database and then compute 
the number of records per slot There are many combinations of field pairs to consider in even a smaD database. 
For example, at least 7 of the 12 fmlds of the EMPLOYEE^INFO table shown in FIG. 1 demonstrate properties which 

15 often result in single-field encodiiq. There are N!/(nl(N-n)n ififferent combinations of N items taken n items at a 
time. Hence, for 7 fields taken 2 at a tune, there are 71/(215!) - 21 unique field pairs. This means 21 
characteristic tests for the first pass atone. Subsequent passes treat a field pair satbfying the criteria as a single 
fiekL AD field pairs are re-tested, where a fieU "pair" would be either two single fields; a single field and a 
comb'mation field; or two combination fields. These passes continue mi no more fieU pairs satisfy the combining 

20 criteria. The straightforward computation of records per skit would kivohre a very large amount of computer 
processmg tone. Instead, a preferred embodiment of this inventton uses statistical field characteristics derived from 
a sample of the database, rather than performing direct computattons on the entire database. 

A sample database is a subset of the records contamed to the full database. One method of sampDng is 
to only process every Mth record of the fuB database, where M is the sample totervaL This uniform sampGng of 

25 the f ufl database has a disadvantage to that a stogte event may be responstole for a number of contiguous records 
to the database. In a preferred embodiment, to order to provkle a good statistical sample of the fuB database, 
staggered sampBng is performed. For staggered sampBng, the requirement is that the average sample toterval A, 
•equals M. For example, if M is 16, then the number of records skipped might be the repeattog sequence 1, 2, 4, 
7, 14, 28, 56, 1, 2, 4, 7, 14, 28, 56, . . . The average number of records sk'qiped is (U2-»-4+7-i'14+28't-56)/7 

30 -16. The average sample biterval A, is defmed as: 

IntKnumber of records)/sampte sizel* where IntO is the largest integer less than the quantity toside the brackets 0. 
Sample size is the number of records to the sample database. In a preferred embodiment as shown in FIG. 6, the 
sample size is x (mmAer of records)^, where is a build configi^tion parameter. The sample size grows with 
larger databases, but not m direct proportion to the size of the database. In a preferred embodtoimit, C4 - 3U23. 

35 This value sets the sample size for a 100,000 record database to 10,000 sample records, corresponding to 10 



WO»5/32477 PCTAJS95/03901 

•13- 

percent of the total number of records in the database. For a 100 milfion record database, the sample size is 
316,228 sample recordSr corresponding to 0.3 percent of the total number of records in the database. 

As shown In FIG. 6, the field characteristics for the multiplB-field combinatbn method are the mean,;/, and 
standard deviation, a, of the number of records per slot for a fieU pair. The criteria, also shown in FIG. 6, are: 

5 II > 012;^ a> 0;fJ > a, where 9 is a specified minimum threshold. Also as shown h FIG. 6, the figure- 
of-merit for decidmg between fidd combmation choices is r - a/ <t> a. hi this embodbnMt,// is projected from the 
sample mean,//^ and a is projected from the sample standard deviation, a,: //-//, x ^; a - or, x (4)^. These 
projections hoU true for stot distributions satisfying the criteria above. 

In a preferred embodiment, as shown m FIG. 6, 9 - x (number of records)^ k < 1. If the relative 

10 distribution of records per slot remams constant as the database size increases, the average records per slot will 
faicrease proportionally. In this embodiment, however, the threshold, and hence the mean records per slotr does not 
increase *m direct proportion with database size. Thus, for larger databases, a wider distribution of records per slot 
is allowed. This is analogous to singMield encoding where a preferred embodiment allows the translatton tabb to 
increase somewhat with larger databases. Here, a wider allowed .distrSnition of records per slot impEes the 

15 percentage of records encoded with the shortest codes is allowed to decrease somewhat with larger databases. 
For a preferred embodiment, k - JS and • (1i1000^ For a database ranging in size from 100,000 to 
100,000,000 record;, this results in 0 rangmg from about 30 to about S,00D records per slot Other vahies of Ps 
can be chosen to aDow fewer or greater numbers of field combmations and hence greater or fewer records per stot 
The dependency of record distributkm per stot on database size can be ahered with other vahies of k. Also, other 

20 functionally different criteria might be chosen. For example, if /i > a and // > 9/2, the number of sample 
database slots is a good projection of the number of database shits. Fields also could be combined based on 
minfanization of the number of database slots rather than maxim'izatton of the fqure-of-merit T. This would result 
in control over the translation table size, as in sbigle-field encodmg, rather than control over the concentration of 
records per slot One of ordinary skill m the art will recognize that the parameters and criteria disclosed above for 

25 muit^ie-field combmation may be modified in a number of ways to achieve various tradeoffs for specific database 
appGcatfains. 

As shown bi FIG. 10, the comb'ming test 154 begins by selecting a pair of ftoUs marked for single-fieU 
encoding 156. Then the database is read by sampling 160, that is, skipping an average of A records at a time, 
where A, the sample interval is given in FIG. 6. Next, tiie sample mean, standard deviation and fqure of merit //„ 

30 a, and f for this fieU pair are computed 162 using the formulas of FIG. 6. If these characteristics meet the 
combining criteria 164, tiie pair is mariced es a candidate for combining 166. Otherwise, tiie pair b rejected for this 
translation and another field pair is selected 156. AB pairs of fieUs are tested m this manner 170. When all 
possible combinations of FieUs are tested, a list of candidate pairs b sorted in descendmg vahies of T 172. By first 
sorting aB candidate pars by f, the most favorable pair combinations can be made. To do this, aO fields are initially 

35 set as being avaBable for combining 174. The sorted Est b read 176 beginn'mg witii the candidate pab havbig the 
largest T. If both fields are avaiable for combinmg 180, the candidate pair b mariced for combining and both fields 
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are marked as hereafter unava9able 182. If ehher or both fields are unavaQable, the candidate pair is not marked 
for combining and the availability status of the fields is unchanged. If all entries bi the ist have not been processed 
184, the next candidate pair from the sorted Est Is selected for combming 176. All candidate pairs are processed 
in this fashion. If any candidate pairs were marked for combinmg 186, folds marked for combmation are combmed 
5 as a new field ISO, and a new pass is started 1S6. Further passes are mrtlated unti no new pairs are marked for 
combinmg 192. In this manner, two or more fields can eventually be condiined into a smgle field. The sorted fist 
of r v^es for each pass can be printed and would provide nisight into database field relationshq>s which mqht 
otherwise be unknown or unsuspected. 

Pattern suppression compaction is a feasible method to apply to fidds that have not been encoded and that 
ID at least have single-position patterns, that is, the same characters occurring in the same character posittons for many 
records. A field is first tested for single-position patterns in each character position of the fiell Then a fieU is 
tested for muftipb-position patterns, that is, combinations of singtef osition patterns occurring together in many 
records. 

The characteristics for pattern suppression, shown in FIG. 6, are: the "per character percentage,** which 

15 is the percentage of record occurrences for each of the Cg most frequently occurrmg smgle-pos'ition patterns; the 
"single-position pattern determination percentage,** virhich is the total percentage of record occurrences for all of the 
Ce most frequently occurring single-positHm patterns; and the ""multiple-positkin pattern detemunation percentage," 
which is the percentage of record occurrences for combinations of single-positmn patterns. The first two 
characteristics ere determined for each character position of a field for snigle-position patterns. The last 

20 characteristic is determined for each different multiple-position (multKposition) pattern which occurs in a record. The 
corresponding compaction criteria, also given in FIG. 6, are the specified minimums, C7, Cg and Cb, for these three 
percentages, respectively. Ce through C9 are build configuration parameters. 

Characters meeting the criteria for a single-position pattern are stored as a single-position pattern and are 
also considered for multqile-posttiDn pattern formation. In a preferred embodbnent, Cb - 4, C7 - 10% and Cs - 

25 50%. That' is, only the 4 most frequently occurring characters in each position are con»dered for pattern 
determination, and each of these characters must occur in at least 10% of the records. The characters satisfying 
the first criterion together must occur in 50% of the records to be stored as a pattern. As an example, in the work 
order field WD of the WORK_ORDERJNFD table, FIG. IB, in the first character position, "A" occurs in 26.7 percent, 
"B" occurs in 53J percent and "0" in 13J percent of the records for a total percentage of 93.3 percent of the 

30 records. Hence, the vahies "A", "B** and '0" in the first character position each satisfy the 10% per character 
percentage and m total satisfy the 50% pattern determination percentage. The character "R" occurs fai the second 
character position in more than 50 percent of the records, thus safisfymg botti criteria. The value in the third 
character position is always a "0," also satisfying both criteria. No vahies in any other character position meet the 
single-posftion pattern suppresskin criteria. Hence "A", "B" and "Q" are the smgle-character pattern values for 

35 cohimn 1, "R" is the single-cheracter pattern value for cohimn 2 and "0" is the single-character pattern vahte for 
column 3. 
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A combination of stngle-position pattern characters might be a multiple'position pattern. In a prefmd 
embodiment each of these character combinations must occur in a minimum percentage of the records, where 
C9 - 10%. Continuing the example from above, the percentage of occurrences determ'med for all mult^Ie- 
position patterns for the work order WO f»ld in the W0RK_ORDERJNFO table is shown in H6. 16. The patterns 
5 winch have muhiple-position pattern percentages greater than 10 percent are advantageously stored as multipb- 
position patterns. These patterns are "ARO/ "BRO," and "B^O/ where tiie refers to any character vahie. Note 
that the combinations "AR_," "BR." and 0R_" do not exist because 0 ahvays occurs in the third character portion. 

Combmations of stngle-posrtion pattern characters which are reacted as not satisfying the multiple-position 
pattern criterion, can be used as component combinations to create a new combination which does satisfy the 

10 criterion. A new combination h formed by mtersecting two or more component combinations, such that the new 
combination contains the cohimns and values that the component combinations have in common, with the new 
combination's multiple-position pattern percentage bemg the sum of the component combination's multqile-position 
pattern percentages. These new combinations are tested to determine if the muhiple-posithm pattern criterion is 
satisfied. Testing is started by considering the component combinations that result m new combinations widi the 

15 most characters. Testing continues, consUering component combinations that result hi new combinations with fewer 
numbers of charactersr untfl aB possible intersections of component combinations greater tiian a smgle character are 
considered. Note that the number of characters hi a combination is defined as die number of single-position pattern 
characters whrch compose the combinatnn. Thus, for example, 'B_0" is a two character combmatmn. 

When component combinations create a new acceptable multipleiiositbn pattern, die components cannot . 

20 be used to intersect with other components. If two new combinations exist with the same number of characters 
and they contain one or more component combinations in common, the new combination with the largest multqile- 
position pattern percentage is saved. Acceptabifity of a multiple-position pattern formed by intersecting component 
combmations is the same as for a multiple-positiDn pattern formed directly from combinations of smgle-posttion 
patterns. In a preferred embodiment, the multiple-position pattern percentage must be greater than 10 percent of 

25 eD records. 

Continuing the example from FIG. 16, the component combinations from rejected single^ osition pattern 
combinations are "^RO/ "A_0," "ORO" and "Q_0." Of aD possible intersectbns of these components, only the 
intersections of components *_R0" widi "QRO" and "Q_0" with "QRO" create new combinations with more than 
a single character. The pattern '_R0," composed of "_R0" and "QRO," has a multiple-position pattern percentage 

30 of 6 7 « 13 percent, whie die pattern "Q_0," composed of 'Q^O" and "QRO" has a multiple-position percentage 
of 5 7 - 12 percent. Thus "^RO" is saved as an additional multiple-position pattern. FI6. 1H summarizes the 
patterns wMch are accepted for pattern suppression in this example. A data value is tested for die existence of 
a pattern by comparing it first widi patterns containing die most characters. FIG. 1H is presented m this order, diat 
is by descending number of characters. The smgle-position pattern value "0" in FIG. 1H has been maintained because 

35 it was not used m any multiple-position pattern. This pattern could be discarded because its percentage b less dian 
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the per character percentage C7. For this example, FIG. II illustrates how the records shown in the 
WORK ORDER JNFO table of FIG. IB are affected by pattern suppression compaction. 

Cg through C9 are chosen as a tradeoff between the storage savings achieved by pattern suppresaon and 
the penalty of compacted record complexity. This record complexity is due to the number of subpartitions (different 
5 type byte values within a partition) and partitions created by pattern suppression. Each removed pattern must be 
noted by a different type byte, and pattern removal causes variations in record length which may generate addittonal 
partitions. Hence, these constants are chosen to repct patterns unless they are prevalent However, other vahies 
for these parameters can be used to increase or decrease the number of patterns recognized. Altematively, the 
pattern suppression criteria might be made dependent on specific system or database characteristic^ such as the 

10 number of fiehfs or number of records. One of ordinary skO in the art will recognize that the criteria and parameters 
disclosed above may be modified in a variety of ways. 

Durmg the pattern test, a field is also tested for the existence of a numeric pattern. A numeric pattern 
exists when at least two positions in the field have digits for aD records. A number is formed by combining these 
positions and this "numeric pattern' is converted to its binary equwalent. Positions do not have to be adjacent for 

15 a numeric pattern to be identified. The binary equivalent of a numeric pattern b stored with the original form of 
the rema'ming positions. For example, suppose the first, third and fifth positions in a field are aB digits, and a record 
contains the data value 8a4x1npq for the field. The number 841 would be converted to its binary equhralent and 
stored with axnpq as the field's data vahie for that record. A numeric pattern may be combined with sbigle* 
position or multi-position patterns or may exist in an uncombined form. This differs from numeric substitution in that 

20 numeric substitution appEes to every position m a field. Note that a numeric pattern is also distingmshed from a 
single-position or multi-position pattern containing a dqitr where a digit is treated like any other character. That is, 
a digit whbh frequently occurs m a position that is not aB digits may be a single-position pattern or part of a 
multiple-position pattern. Whereas an numeric pattern only requires more than one positions of afl digits. 

H6. II iUustrates numeric pattern conversion appled to the work order "WO" field of the 

25 WORK ORDER JNFO table of RG. IB. The third column of the table shows that the thM through snth positions 
of the WO field, after pattern suppression, comprise a numeric pattern. The fourth cohmin of the table shows the 
data vahies that are not part of the numeric pattern. Each numb^ associated with the numeric pattern b converted 
to its binary equhralent and stored with the remaining data values shown in the fourth column. Although the 
preferred embodknent of numeric pattern conversion requm that a position considered as part of the numeric pattern 

30 must contain aB digits ni aB records, one of ordmary skBI m the art wlB recognize that the same concept wiB woric 
if less than aB records contain digits in the posithsns considered as part of the numeric pattern. In that case, only 
those records contammg the numeric pattern would be converted. 

As shown in FIG. 11, for the pattern test 194, a field is selected 196 and checked to determme if it is 
marked for encoding 200. If so, pattern suppression is inappficable, and if aD fields have not been tested 202 

35 another field is selected 196. If the field is not to be encoded, the database records are read for this fieU 204. 
Next aB character positions are scanned for pattern characters. This is done by first selecting a particular position 
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within the field 206, end then counting the number of occurrences of eech character in that position 210. For e 
particular position, the C, most frequently occurring characters are identifbd 212. The field characteristics for that 
position are determined next by computing the per character percentage and pattern determination percentage 214. 
H is determined whether any of these characters satisfy the per character criterion 216, and if so, it is determined 
5 if the pattern determination criterion is satisfied 218. If the position satisfies both criteria, the position is marked 
as containing a pattern 220. All portions are tested to this fashion 222. If at bast two positions are all digits 
224, these positions are marked as a numeric pattern 226. After all fields are tested In this manner 202, the 
muttiple-position test is begun. 

ContinuRig with FIG. 11, the multiple-podtion test starts by selecting a field which contains more than one 

10 single-position pattern 230. If there are none, the process ends 231. Otherwise, the database records are read 
232. The number of occurrences of each different condiination of singleiiosition patterns which occur together ni 
a record is determined for the field 234. These combinations are sorted by length 236, where the length is the 
number of single-position patterns 'n the combinathm. Going down the Bst from the longest to shortest combination, 
each combmatioTi is compared with the multiple-position pattern determination criterion 240. If a combhation 

15 satisfies the criterion, then it is marked as a midtif osition pattern 242. If a combination does not satisfy the 
criterion, it is nevertheless retabied for further processing 244w AB occurring combinations are tested in this manner 
246. Combinations which did not satisfy the pattern detemrination criterkin 250 are intersected with other rejected 
combinations to see if there is a smaller common pattern which meets the pattern determination criterion. First, the 
longest rejected combination is determined 252, and a search length for ntersection patterns is initialized at one less 

20 than the longest rejected combination 254. A search is made for a pattern m common with two or more component 
combinations equal to the search length 256. If any nitersections of this length are found 260, then the occurrence 
count of each rejected component combination is added 261 and the total is compared with the multiple-position 
pattern determmation criterion 262. If this intersection combination satisfies the criterion, the component 
conAhiations are discarded and the intersection is saved as a muhi-positran pattern 264. Otherwise, the search 

25 length is reduced by one 266. Until aD component combnations have been considered 270, the intersection searches 
continue 256. AD fields with more than one single-position pattern are processed in this fashion 272. 

A field which cannot be encoded might be cmnpacted by numeric substitution. Numeric substitution is 
appGed on fields in which most of the vahies are numeric The numeric vahies of these f ieMs are encoded with the 
binary rather than the character string representation. The remabimg values in these fields, those containmg values 

30 other than digits, are stored in their orighal character representation. Each digit in a numeric field represented by 
character strings typically requires a byte of storage. In b'mary form, however, one byte can represent up to 3 digits 
(numbers 0 through 255), two bytes can represent up to 5 digits (numbers 0 through 65,536), three bytes can 
represent up to 8 digits (numbers 0 through 16,774,216), and so on. Hence, there can be considerable compaction 
by this substitution. 

35 For numeric substitution, unencoded fields with many records contam'ing only numeric character strings 

would have each of those data vahies represented by a b'uiary equnralent. The criterion for numeric substitution. 
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shown in FIG. 6, is a minimum C^o percentage of records containing afl numeric characters. This criterion detemnines 
the tradeoff between achievable compactbn versus partition complexity. Each character string would be replaced 
by a 1 to 4 byte number. The type byte at the end of each compacted record must specify the number of bytes 
substituted by ths method. This potentially quadruples the number of subpartitions. However, a 44yte number can 
5 represent the character string 4294199296, which would otherwise require ten bytes (one byte per character). This 
reduces storage by many bytes per substituted record. In a preferred embodiment Cto is 90%. That is, at least 
90% of the records for a particular field most contain numeric data values before numeric substitution is applied. 
Otiier values for this build configuration parameter are possflile, with smaller values of C,o aDowing this method to 
be appEed to more fields. Altemathrely, the criterion may be chosen to vary with specific database or system 
10 parameters. For example, the criterion for numeric substitution may be made proportional to the number of fields, 
because fewer fields would tend to generate fewer subpartitions. One of ordinary skiB m the art would recognize 
other variations in the parameter and criterion for numeric substitution for achievmg different tradeoffs. As an 
example, FIG. II shows the records of the WO field of the WORKJRDERJNFO table of RG. IB for which numeric 
substitution can be appRei 

15 As shown m RG. 12, for the numeric test 274, a specific fffild is selected 27B, then checked to determine 

if it is marked for encoding 280. If so, numeric substitution is inappGcable. If all fields have not been tested 282, 
another field is selected 276. Otherwise, the database records are read for tins field 284. If the field is mariced 
for pattern suppression 286, the detected patterns are masked out 290. Next the fieU characteristic consisting 
of the percentage of records containing all numeric data vabes (after patterns are removed) is computed 292. If 

20 this characteristic meets the numeric substitution criteria 294, shown ni RG. 6, the field is mariced for numeric 
substitution 296. When aD fidds have been tested m this manner 282, the numeric test is complete 300. 

Text compression is only appied to text f ieMs. Text fields ere unencoded fields containmg multiple words 
of text in most records, wHh a relathrely small number of different words in the field. Specifically, a field is 
determined as being a text field when at least half of its vahies contam muHqile words and die total number of 

25 words is greater than the totel number of records. Note that a "word" m this context means any sequence of 
alphanumeric charactera separated from^other a^hanumeric characters by delimiters, such as spaces, commas, 
periods, etc. A word would hchide a sequence of numeric characters, for example. The criterion for text 
compression, shown in FIG. 6, is a maxmium number of different words. The rationale for this criterion is the same 
as for ^gle-field encoiEng • a tradeoff between storage savmgs by compression and storage overtiead due to the 

30 resulfing translation table. In a preferred embod'anent the criterion is C„ » (number of delimiters)^, where 
deSmiters are defined by the user or are, by default, a space, comma, period, slash, dash, colon, semicoton, quotation 
mark or end*of-Rne character. A sequence of deftniters is treated as a single debniter. The number of de&miters 
is an est'onate of tiie total number of words in the field. Hence, this criterion is anabgous to the criterion Ct ^ 
(number of records)^ used for the smgle-fieU encoding method. That is, the translation table size is allowed to grow 

35 with the number of words, but is not allowed to grow in dvect proportion to the size of the database. In a 
preferred embodiment C,i • 63.246, which yfelds the result that 100% of a Mi of 4,000 words can be different 
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words, 6.3% can be different if there are 1,000,000 total words and 0.63% can be different if there are 
10,000,000,000 words. One of ordinary sklD in the art will recognize that other values of the bold configuration 
parameter can be used m this f ornuila, and different criterion can be used to achieve different tradeoffs. 

Referring to FIG. 13, the text compression test 302 begins by selectmg a particular field 304. Next, it 
5 is determined if the field selected is marked for any other compaction method 306. If so, text compression is 
inappEcable. If an fields have not been tested 310, another field is selected 304. K no other compaction methods 
have been appEed, the records for this field are read 312. The fmid characteristic tested is the number of different 
words in the field 314. This characteristic is compared to the text compression criterion 316, shown in FIG. 6. 
If the criterion is satisfied, thb fteld b marked for text compression 320. Otherwise, the field is not marked, and 

10 if all fields have been tested 310, the text compression test is compbte 322. 

Referrhg back to FIG. 7, after the fields are tested 86 to determine the appEcable compaction methods 
for each field, the next step is sebcted 90 based on the compactbn method marie for a fbld. Relds marked for 
singlefmultipte field encoding are encoded separatdy 94 from fields marked as text fields 102. AO other felds skip 
the encoding step. As shown in FIG. 14, the single/multfile fieU encodmg process 324 begins by first selecting a 

15 fieU 326 (one marked for singb field encoding or a set of fields marked for multiple-field combining). Then, the 
database records are read 328. Next, the number of records in each slot b determined 330. Slots are each 
different data vahie for stngb-field encodhg or each different combination of data vabes for multiple-feld 
combination. The slots are sorted in descending order by the number of records in each slot 332, with the sbt 
havmg the most records (the "largest" slot) being first and the dot having the feast records (the "smallest" stot) 

20 being last. Numbers are then assgned to thb sorted Est 334 by the position of the sbt in the sorted Est, with the 
largest slot assigned the number 0, the second largest sbt assigned the number 1, etc These assigned numbers 
are the numeric equhrabnts of the sbts. That b, an assigned number can be used to uniquely identify a slot Thb 
completes the translation tabb for thb field. The particubr integer-byte-bngth code to use for thb field (single-byte 
code, combinatbn code or variable*byte code) b determined next 336, (see bdow). If another marked fbld or fbld 

25 combination exbts 338, it b sebcted 326 and another translation tabb b created. Otherwise, the transbtion tabte 
creatbn process b compbte 340. 

Agabi referring to FIG. 7, a text f»U b encoded separately from other fblds 102. FIG. 15 shows the text 
encodhg process 342. A field marked for text compression b frst selected 344. Next, the database records are 
read 346. The number of occurrences for each different word and debniter are then determined 350. The different 

30 words and delimiters are sorted separately in descending order by frequency of occurrence 352, with the most 
frequently occurring word and derDDiter being fm and the bast frequently occurring word and deGmiter being last. 
Words and deBmiters are then assigned numbers 354, based on position in the sorted Bsts. The most frequently 
occurring word and deEmiter are assigned the number 0 and the next most frequently occurring word and defimiter 
are assigned the number 1, etc. These assigned numbers are the numerb equivabnts of the words and delEmiters. 

35 Thb compbtes the translation tabb for singb words. 
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Besides single word compres^on, the text compression method else incorporstes phrase compression. 
Hence, the translation table for a text field possibly mcludes phrases. A phrase b two or more words with 
embedded deSnuters. A phrase component is a word or phrase used to test for larger phrases. For example, assume 
tiie "EQUIP/SERVICE' field of the "WQRKJRDERJIVFO" table of FIG. IB meets the criteria to be considered a text 
5 field. The words in the field are "radio,* "console/ "antenna," etc The frequently occurring words "radio" and 
"console" are considered phrase components. These components are then used to deteranne if a combination of 
components occur fat the field. The combination of phrase components "rad» console" does occur. Assuming "radio 
console" appears frequently, it is declared a phrase. The resulting text phrase translation table is shown m RG. 23E. 
The deSmtter translation tabb is shown m FI6. 23F. 

10 During phrase constructiori, phrases are constructed two components at a time on each pass through a text 

field. Passes continue unt3 no new phrases are recognized. This procedure is analogous to that used for 
constructing muhqile-field combbiat'ions, where fields are combined two at a time per pass. The number of possible 
two component phrases are: (number of phrase components) x (number of phrase components • 1). To prevent the 
computations from becommg unwieldy, the number of phrase components considered are Gmited. In a preferred 

15 embodfanent, tiiis Gnrit is: maxmnim number of phrase components - Cq x (number of de&miters)^'', where - 
63.246. Because the number of deHmiters is an estimate of tiie number of words in the text field, tiiis fonits phrase 
construction to about 200 components in a text field of 4,000 words and about 500 components m a text field of 
10,000,000 words. This results m about 31.5 thousand phrases to be checked in a 4,000'Wonl text field and about 
224 thousand phrases to be checked in a 10,000,000*word text field. A text field is considered for phrase 

20 construction if hs characteristic satisfies the phrase construction criterion. The relevant characteristic is the number 
of words represented by the most frequently occurring words, truncated at the maxbnum number of aDowed phrase 
components. The criterion is a mmimum percentage, C13, of the total number of words m the fmU. In a preferred 
embodiment, Cn - 10%. Hence, for a 4,000 word text field, phrase construction occurs if the 200 most frequently 
occurring words account for 400 of aP words in the field. For a 10,000,000-word text field, the 500 most 

25 frequently occurrhg words must account for 1,000,000 of all words. The phrase construction passes are computed 
on a sample of the database. The same sampFing mterval and methodology is used for text compresshm as is used 
for multiple-field cond>ination. (See discussion above). 

When aD different words have been assigned a numeric equivalent, the fieU is tested to see if phrase 
constructton should occur. As shown in FIG. 15, th» phrase construction test begms by computing the number of 

30 phrase components to consider 356. Next it b detemnned if the initial phrase components satisfy the phrase 
construction criterion 360, shown in FIG. 6. If so, a sample of the database is read 362. The sample database 
is scanned for aB combinations of two words or phrases with an intervening delimiter that occur m sequence bi the 
sample, counting the number of occurrences 364. The list of different words and phrases is sorted by number of 
occurrences 366. The words and phrases list is then truncated to the computed number of phrase components to 

35 consider 370. If any new phrases were created 372, another pass is made 362. Otherwise, the process is ended 
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for this Mi, and if any more fields remain to be processed 374, another text field is selected 344. When all text 
fields are processed, the text compression process is complete 376. 

After all translations tables are constructed, enough information exists to create the compacted records 
which win eventually forni the database image. As shown in H6. 16, compact record creation 380 begins by reading 
5 a specific database record 382. A field is selected from within this record and its data vabe is read 384. A 
compacted data value is generated by determining the compaction method or methods applicable to this field. If the 
field has a translatran table 386, ft is determined H this field is combined wfth others 380. If so, the data values 
of the other felds are read 382. The numeric eqirivalent of this data value or these data vahies are read from the 
translation table 384 and the corresponding code a generated 386. The code is then inserted into the corresponding 

10 field of the compacted record 400. If the field selected has a partial translation table 402, then it is determined 
if the current vahie is m the table 404. If so, the numeric equhralent is read 384, the code is generated 386 and 
written into the compacted record 400. If the field selected has a pattern table 406, the pattern is suppressed from 
the data vahie 410. If the field is mostly numeric 414 and the data vahie read is marked as numeric 415, it b 
converted to its bmary equnralent using the minimum necessary integral-number of bytes 416. This binary number 

15 is then Inserted mto the compacted record 418. Otherwise, the data vahie is directly inserted into the compacted 
record 420. Finally, if the field is a text field 422, text comprossnm is appfied 424 and the compressed text is 
inserted mto the compacted record 426. In aD other cases, the data vahie is inserted into the compacted record 
in originel form 428. The number representing whatever storage method was used is then written to the pack 
information array 430. If all fields ki the orqmal record have not been processed 432, the next field is selected 

20 384. Otherwise, ft is determined if any records remain to be read 434. If so, the next record is read 382. 
Otherwise, the compacted record creation process b finbhed 436. 

FIG. 17 shows pattern suppression 440. The current data value b compared wfth the patterns contained 
m the pattern mformation table for thb field 442. If more than one pettem matches 444, the largest panem b 
selected 446 and masked out of the data value 450. The pattern number b then saved for the pack informatkm 

25 array 452. K there b only one match 454, that pattern b selected 456 and masked out of the data value 450. 
If there ere no matches, pattern suppression b complete 460. 

Shown in FIB. ISA, text compressum 462 begms by identifying each word, phrase and deGmfter in the text 
fieU of the record being compacted 464. The bngest phrases are attempted first, f oBowed by shorter phrases and 
then words. Once a word, phrase or doEmfter b identified, fts numeric equivalent b read from the translatmn table 

30 or defanfter def mition table 466. At thb point the numeric equhralents for aD words, dennuters and phrases in the 
data vahie b known. These numeric equhralents are then used to create a sequence of numbers which are the 
equhralent of the data value 470. Each word, deFanfter or phrase of the data value b represented by one number 
of the sequence, and these numbers are concatenated m the order of occurrence of the words, deEmfters and phrases 
in the data vahie. Thb completes the text compression process 472. 

35 A single data vahie of a text field may contam many words, derimiters and phrases. The concatenated 

sequence of numbers which represent the compressed text of the data vahie may be many bfts long. FIG. 18B 
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describes a preferred embodiment of the format for this potentially long sequence of numbers, bi this embodiment, 
each of the sequence of numbers is of variable length, with the number itself describ'mg its length and information 
content SpecificaDy, a number representing a word, defimiter or phrase is of 1, 2 or 3 bytes. The first 44)its of 
the number describes how many bits are in the remainder of the number and what the remaining bits represent 
5 For exampta, if a word or phrase has a numeric equhralent ranging from 16 through 4,111, it is represented by a 
two4»yte number. The first 4-bits of this number would be the number 3, and the remaining 12-bits of this number 
would be the value of the ntoneric equivalent If the first 4-bits are the number 15, this represents the end of the 
data value. FIG. 18B shows that the first 4'bits might be the number 0 or 1. These vahie are intended for 
appEcations ni which some text words are not encoded (similar to partial encoding). For example, in applications 

10 that cont»n many numerics within the text fields^ the numerics may not be encoded. 

VRien the most prevalent delimiter occurs between two phrase components, where a phrase component is 
ehher a word or phrase, its numeric equivalent (equal to zero) b not stored. When the most prevalent deGmiter 
occurs at the beginning or end of a record its nuner'ic equhralent must be stored. Blanks are used as fillers to create 
a fixed-length record, as explained below. 

15 As noted above, compaction methods are applied so as to create a limited number of fued-length records, 

where length is defined as the number bits of requred storage. A text field b likely to have a wide range in the 
number of phrase components per record, which impBes a wide range in record length. In order to fimit thb Inherent 
variation m record length, a smaO number of aDowed data vahie lengths are defined for the text fieM, and the data 
values in the text field of each record b forced to one of these lengths. A preferred embodiment accompfishes thb 

20 task by first identifying the record with the greatest number of characters in the text field, Le. the greatest data 
value length. If L - (greatest data vahie Iength)f4, then four data value lengths are defined: L, 2»t 3xL and 4>cL 
The shortest defined tength which wffl fuDy contain the concatenated text data b used for each text data vahie. 
The unused portbn of thb allocated length b left blank, writh a termination character indicating the end of data. 
Hence, a text field can have data vahies of four possible tengths. An entry in the pack tnformatkm tabb for each 

25 record specifbs the bngtb used. Four bngths were chosen as a tradeoff between the number of subpartitbns 
created and the blank storage wasted in each record. Other vahies could be used, and the number of lengths couM 
be made a function of database paranraters (such as total number of fblds). One of orcEnary skO m the art will 
recognize other possibh variations bi allowed text field data value bngths. 

As expbined above with respect to H6. 16, during compacted record creation, codes are generated 

30 corresponding to the numeric equhralents contained m the single-field and muhiple-fbU translation tabtes. In a 
prefened embodiment of the present mvention, code generation creates a unique mtegraHiyte length code for each 
numeric equ'nralent in a transbtion tabb. (A byte b typicafly considered to be 8-b*its). Integral-byte codes are 
used to achbve a reasonabte tradeoff between the number of possibb record-lengths (and hence partitions) and code 
bngth. One of ordinary skiD in the art wi8 recognize that otiier code bngths are f easibb. For exampb, integral* 

35 nibble bngth codes may be used. (A nibbb b typically considered to be 4-bHs, but here it could be from 2 to 7 
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bits). In a preferred embodiment the codes selected, dependrng on the field characteristics, are the single-byte code, 
the combination code and the default variable-byte code. 

The variable-byte code uses the minimum integral-number of bytes necessary to encode the different slots 
of a field. That is, the numeric «|uhralents of the 256 largest slots (those containnig the most records) are encoded 
5 with a single byte (i.e. numeric equwalents 0 through 255 are encoded as 00,b through FF,6), the next 65,536 largest 
slots are encoded using two bytes G.e. numeric equhratents 256 through 65,791 are encoded as 0000,s through 
FFFF,^, the next 16,774,216 largest slots are encoded us'mg three bytes (000000,b through FFFFFF,6), and so on. 

In accordance with a preferred endiodiment of this inventhin, if a field has a Emited number of slots, the 
mtegral-byte length code used for field encoding can consist entirely of a one byte code. One byte can only 

10 represent 256 different vahies, but this range can be extended by utilizmg a designation appended to each compacted 
record, called the type byte. The 'smgle-byte code" consists of a first smgle-byte which represents the 256 largest 
slots fi-e. numeric equhralents 0 through 255 are encoded as 00,b through ff^. A second single-byte represents 
the next 256 largest slots fte. numeric equnralents 256 through 511 are also encoded as 00,, through FFt^). A thN 
single-byte represents the next 256 largest slots (le. numeric equivalents 512 through 767 are also encoded as 00,s 

15 through FFJ^ and so on. The type byte specifies (among other thmgs) whether a first, secondr thM, etc shigle-byte 
represents the data. The type byte Is explained in detail below in the discussion on partitioning. The advantage 
of. the single-byte code over the variable-byte code is that aD of the largest slots (those containing the most 
frequently occurring data vahies) are represented by a sbigle byte, whereas only the 256 largest sbts of the variable- 
byte code are represented by e single byte. Thus, the single-byte code achieves reduced storage requirements for 

20 the database data. However, use of the single-byte code can drastically increase database storage complexity as 
compared to use of the variable-byte code. By restricting appBcation of the single-byte code to fields with a Emited 
number of slots, the storage complexity and access tone of the database is comparable to that achieved by use of 
the variabb-byte code. The discussion below on partitioning clarifies this tradeoff. 

In a preferred embodiment of this inventton, if the number of sbts m a fbid are too numerous to feasibly 

25 use a sbgb-byte code, but a rebthfely small number of slots contab most of the records b a field, then an btegral- 
byte code usbg a combmattan of sbgle-byte and varbble-byte codes can be used. This "combbation-code** encodes 
the largest dots usbg one byte and the remainmg sbts using multiple bytes. Thu$, the first single-byte of the 
combbation code represents the 256 brgest slots (Le. numeric equhrabnts 0 through 255 are encoded as 00|b 
through FF^s). The second single-byte of the combbation code represents the next 256 largest slots (i.e. numeric 

30 equhrabnts 256 through 511 are abo encoded as 00,| through FFJ. The third sbgb-byte of the combbation code 
represents the next 256 largest sbts fi-e. numeric equnralents 512 through 767 are also encoded as OOig through 
FFis), and so on. Then, two bytes ere used to represent the next 65,536 brgest sbts. (For exampb, numeric 
equhrabnts 768 through 66,303 are encoded as 0000,e through FFFF,b). Then three bytes are used to represent 
the next 16,774,216 largest slots, and so on. The type byte distlngubhes between the various sbgb bytes (e.g. 

35 the first second and thM smgb-bytes as above) and also indicates use of the multiple-bytes. As the name suggests. 
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the combination code faDs between the single-byte code and the variable-byte code m achieving a tradeoff between 
reduced storage requirements and increased storage complexity. 

As shown in R6. 6, the s'mgie-byte code is used if the field's translation table contams a maximum 
number of entries, hi a preferred embodiment, four singte*bytes are used to represent the numeric equivalents 0-255, 
5 258-511, 512767 and 768-1023. The single-byte threshold b chosen to be: - 973. This vahie is the point 
where the last smgle-byte b 80% full allowing further entries to the translation table. Four single-bytes provides 
a reasonable tradeoff between one-byte storage and the number of subpartitions created by four code types for one 
field. One of ordinary skiD in the art will recognize that more or less than four single-bytes can be used, and other 
values of can be specified to achieve other system tradeoffs. 

10 If the single-byte code threshold is exceeded, a combination code might be used. As shown in RG. 6, the 

combination code is used if the number of records contained in the slots assigned to translation table entries 256- 
511 exceeds the combmation code threshold, C^, which is a speciTffid fraction of the number of detabase records. 
In a preferred embodiment, two one-byte combination codes are used to represent numeric equivalents m the ranges 
0-255 and 256-511, a two-byte code represents the numeric equhralents 512-66047, a three-byte code represents 

15 the numeric equivalents from 66048-16843263. The combination code threshoM is chosen to be: Cts-0.30. This 
achieves a reasonable tradeoff between the number of slots represented by single bytes and the number of 
subpartitions created by four code types. One of ordinary skill in the art wil recognce that other values of Cis can 
be specified or more than four combination-code types can be established with multiple threshold criteria tb achieve 
other system tradeoffs. K neither the single-byte code nor the combination-code thresholds are satisfied, the default 

20 variable-byte code is used as noted above. 

Referring to FIG. 19, the code determ'mation process 474 tests the characteristics of a field's translation 
table and marks the field for either single-byte codes, combination-codes or variabb-byte codes. This, process begins 
by comparing the number of translation table entries with the single-byte code threshold 476. If the number of 
entries is greater than this maximum, the single-byte code is rejected and the combination code is tested 480. For 

25 the combhiatnm code, the number of records in slots whose numeric equhralents range from 256 to 51 1 is compared 
with the conAmation-code minimum threshold. If the combination code threshold is not satisfied, then the 
combmation code b rqected for thb field and the field b mariced for using a variabte-byte code 482. Otherwbe. 
if the singb-byte code threshold b not exceeded, the f bid b marked for the single-byte code 484. If the singb-byte 
code threshoU b exceeded, but the combmation-code threshold b met, then the field b marked for the combination 

30 code 486. When the field has been marked for one of the three codes, the code determination process b comptete 
490. 

Referring to HG. 20, the code generation process 492 takes a numeric equhrabnt from a transbtion tabb 
and converts it to en integral^yte-bngth binary code, which b then mserted m a compacted record. First, the code 
to be used for thb field b determined 494. The f»ld was mariced for a parttcubr code during the code 
35 determination process shown in FIG. 19. The number to be encoded, N, b specified 496. N b a numeric equbabnt 
read from the field's transbtion table. It b determined if the field b marked for a single-byte code 500. If so, the 
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single-byte generated is N modulo 256 502. The storage code is also generated B04, which specifies the parttciriar 
stngle4iyte code used. The stngle-byte code used is bit [Nf256L where intD b the smallest mteger greater than the 
number in brackets Q. Altemathrety, if the field was marlced for a combination code 506, then the combination code 
generated is a function of N 508. If N < 256, then a one4iyte code equal to N is generated 510, end the storage 
5 code - 1. If 255 < N < 512, a one-byte code equal to (N - 256) is generated 512, and the storage code equals 
'combmation 1." If N > 511, then a mul^le-byte code equal to (N • 512) is generated 514 using the minfanum 
number of bytes. The storage code b set to "combination 2' if N < (65536 -i- 512) - 66048, otherwise the 
storage code is set to "conAination 3." As a defauitr a variable-byte code equal to N is generated 518, where the 
storage code is set to the minimum number of integral bytes required to represent N. After e code is generated for 

ION, the process is finished 520. 

The data vahie for each field of a compacted record may vary as to how the data vahie is compacted or 
"packed." That is, the data value variables are the specific patterns suppressed, the type of code generated 
(variable-byte, single-byte or comb'mathin code) and the number of bytes generated. A temporary pack biformatmn 
tabte is formed during creation of the compacted records in order to keep track of these variables. The pack 

IS information table consists of a row for each compacted record and a cohmm for each field of the compacted records. 
At the intersection of each row and cohmm is a one^iyte entry, called a pack method identifter. The least s^nificant 
half (4-bit$) of the pack method identifier is the storage method number. The most significant half (4-bits) of the 
pack method identiTier is the pattern number. The storage method numbers, as defined for a particular embodiment 
of this invention, are shown in HG. 23IUL The pattern numbers are obtamed from a pattern definition table for a 

20 particular fieU. See, for example, FIG. 23L A pattern number of zero specifies that no pattern was suppressed from 
this data vahie. 

For example, suppose the fifth row and fourth cohmm of the pack information table contamed the pack 
method identiTier 82|9. This would indicate that the fourth field of the fifth compacted record contams a compacted 
data value from which was suppressed pattern number 8. Further, the storage method is Z This anplies that the 

25 fourth field has numeric substitution eppfied to it because patterns are not applied to fields with translation tables. 
Therefore, the mmieric portion of this field is a value between 256 and 65,791, which can be determined by adding 
256 to the compacted data vahie. The original uncompacted data value could then be reconstructed by converting 
the numeric portion of this field from binary to character representation and tesertmg pattern number 8 from the 
pattern table into this character representatnn. 

30 With four bits available for the pattern identifier, the number of possible patterns suppressed in any 

particular field b Imited to 15. If more than 15 patterns per field for a specific appricathm were to be required, 
another byte couU be added as part of the pack method identifier. The temporary pack information table b used 
to buBd the permanent record information table, which abo specHbs how each data value in each field of a 
compacted record b "packed" (see below). 

35 The partitioning process creates the database image. The partitionfaig process groups the compacted records 

by length (the number of bytes of storage required by a compacted record). Equal-length compacted records are 
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placed in contiguous storage areas, caDed partitions. In a preferred embodiment, each partition contains up to 
16,777,216 compacted records. Each partition is assigned a partition number which b associated widi the 
compacted records contained m tiie partition. Further, a type byte is appended to each compacted record which 
defines how each fieU of the compacted record was "paclced." The result is a database bnage. Construction of 
5 the database anage is detailed in the f oBowing discussion. 

Central to the partitioning process is the record mf ormation table, which is constructed using the temporary 
pack information table. The pack mformation table b created concurrently with the compacted records and b 
described above in the dbcussion of code generation. RnaD that there b one row of the pack information table 
for each compacted record, and that there b one cohmm of the pack informatnn table for each field of the 

10 compacted records. Abo recall that at the intersection of each row and column b a pack metitod identifier uidicating 
the pattern suppressed from the compacted data value and the code used to encode the numeric equhralent of the 
corresponding data value. Each different row of the pack information table b used to form one column of the record 
information table. Thu$, there are fewer record information table cohimns than pack information table rows. 
Therefore, each row of the record information table corresponds to each fieU of the compacted record;, and each 

IS cohmm of the record information table corresponds to a "type" of compacted record. A type of compacted record 
b defined by the pack method identifiers contained in the record mformation table cohimn. That b, all compacted 
records whose fieUs have the same patterns suppressed, the same codes appibd (variable-byte, single-byte or 
combmatron) and the same resulting code bngths generated (one-bytOr two-byte, etc.) wiD be represented by a singte 
cohmm of the record mformation table. After the record inf ormatton table b created* there b no hinger a need for 

20 the pack inf ormation tabte. 

Due to the potential size of the pack information table, in a preferred embodiment, only one row of the pack 
information tabte b mamtained by the build subsystem. After a row b buit for a particubr record, the record 
information tabh b checked for a column that matches thb row. If a matching cohmm b not found, a new record 
mformation tabte cohmm b created. The pack mformation tabte row b then dbcarded. Thb procedure b repeated 

25 for every record m the database. 

Each cohimn of the record nif ormation tabte b assigned a unique two-byte vahie, the most significant byte 
b a partitten number and the teast spificant byte b the type byte number. AD compacted records of the same 
tength (numb^ of bytes of storage required) are stored te the same partition (unbss that partition already contains 
16,777,216 compacted records in which case another partitten for compacted records of that length b created). 

30 The designated tength of a parthion b the tength of the compacted records stored in the partition. Therefore, aB 
cotemns of the record mformation tabte corresponiBng to compacted records of a particular tength are assigned the 
partition number of a partition of that designated length. 

As previously stated, each cohimn of the record information tabte, besides being assigned a partitten number, 
b also assigned a type byte number. The type byte, referenced above in the dbcussten on code generation, b an 

35 overhead byte appended to each compacted record. Each cobmn of the record information tabte assigned a particubr 
partition number can be assqned one of 256 type byte numbers. That b, there b a unique type byte number 
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assigned to each column with the same partition number. The type byte number designates the type of "packing" 
specrTied by the record information table column to which it is assigned. The type byte of a compacted record has 
the value of the type byte number of the corresponding record information table cohmm. The hcation of the type 
byte in the compacted record is not critical The type byte may be located at tfie beginnbig, tiie end or anywhere 

5 in the middle of the compacted record. For ease of description, the type byte will be referred to es being appended 
to the end of a compacted record. 

An example record mformation table b shown in FIG. 23J. The columns 0E17ts and OEISie designate tiiat 
compacted records stored in partition number 14 witii type byte values 23 and 24 (17i8 and 1B|J are packed 
according to the two columns shown. Specifically, freld 2 of any compacted record in partition number 14 with type 

10 byte 24 has pattern number 11 (Bu) suppressed and the numeric equhratent is encoded using a four-byte variable 
byte code (see entry 4 of FIG. 23M). 

Constructed m this mannerr the record information table contains the information necessary to initiate 
retransiation of aD the Fields in any compacted record. That is, if a contacted record is read from the database 
image, the partition number it is read from and the value of the type byte appended to the record specify a cohmm 

15 of the record information table. That cohmin, in turn, specifies the storage method number and pattern number for 
each fffild within a compacted record. From this information, the uncompacted vahies for all fields may be 
determmed using the appropriate translation and auxiliary tables. 

Because the partition number is represented by one byte, there are 256 possible partitions. The 256 
different values of the type byte form "subparthions" w'lth'm each partition, consisting of compacted records witii 

20 the same type byte value. The largest vahie for the type byte, 255, designates that this particular record has been 
marked for deletion. 

As noted above in the discussions on pattern suppression and on coding, a tradeoff exists between reduced 
database image storage requirements and increased database image storage complexity. Database bnage storage 
can be reduced by relaxing the criteria for identification of patterns or for use of smaller length codes (such as the 
25 single-byte code). Doing so, however, Bicreases the number of possible storage methods and pattern numbers for 
each field. This, in turn, mcreases boti) the number of possible compacted record lengths (hence, the number of 
partitions) and the number of possible types of compacted records (hence, the number of subpartitions). The greater 
the number of partitions and subpartitions, the greater the storage complexity and the slower the access to data 
in the database image. 

30 As shown m FIG. 21, the partitioning process begins with an empty record information table and the 

partition number value initiaSzed to zero 524. A row of the pack information table is read 526. This row is then 
compared with cohimns *m the record mformation table 530. If there is no match, this row becomes a new entry 
in the record information table. This new entry is created by fkst determining if there are any exbtmg partitions 
containing records of equal size to the compacted record 532. If no such partition exists, a new partition is created 

35 with a type byte equal to zero 534. This new partition is designated by the current value of partition number. The 
partition number vahie is then incremented 536. The pack information row is then stored in the record information 
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table at a cohimn designated by the vahie of partition number and the value of type byte 540. If there is an 
existing partition of equal conipacted record size, the highest eiistmg type byte vahie In this partition is determined 
542. The next sequential type byte vahie is determmed 544. K the type byte vahie b equal to 255, the partition 
b fuB and a new partition is created 534. Otherwise, the pack information row is stored as a column m the record 
5 informatron table 540. If the pack mformation row matches an entry m the record information table, that entry 
specifies a correspondhig partition numbor and type byte vahie 550. A type byte equal to the specified type byte 
value is appended to the compacted record which corresponds to the pack faiformation row 552 and that compacted 
record is stored in the next sequential location of the partition equal to the specified partition value 554. This 
process is repeated untH all rows in the pack mformation table are read 558. At this point, the partitioning process 

10 is complete 560 and the compacted records have been formed into a database image. 

As previously discussed with respect to FIG. 5, an index access structure provkies en efficient data access 
means. As described above, the index access structure for the database miage con»sts of an index table containmg 
page pointers and s page table conteining record pointers, for each speciTttd index. Creation of these data structures 
first requires creation of a temporary index array for each indexing field. Each different data vahn in an indecmg 

IS field becomes an sidex vahie. Each temporary index array is a sorted (e.g. alphabetically) fist of index vahies. 
Associated with each mdex vahie is a 4-byte counter whUi specifies the number of occurrences of a particular mdex 
value in the hidex field. After a temporary index array is constructed, it is converted to an mdex table by replacmg 
die occurrence counter with a pointer into a page table. A page table consists of one or more "pages," where each 
page is a portion of memory which contains 65,538 record pointers, end each record pointer b a 4-byte value which 

20 specifies a storage tacation of a compacted record witiim the database image. One byte of each record pointer 
specifies a partition number, and the other three bytes specify the relative record position whhm a partition. The 
record pomters for a particular imiex vahie are contamed in a contiguous area or "block" of the page table, but tius 
block of record pointers may span several pages. Initially, each page b only partblly filled witii record pointers in 
order to aibw room for data expansion. As shown in FIG. 6, tiie number of pages in the page table - intHnumber 

25 of records in the database) / (65,536 « C,b)L where C,| b the niitial page fil ratio and mtQ b tiie smaDest integer 
greater than the number in brockets Q. For a preferred embodknent Cis - JO. That b, 80% of each page of the 
page table b initially faied with data pointers. The page pomters hi tiie bidex array are four bytes. One byte 
specifbs a page mmiber and two bytes specify the beginning of a btock of record pointers. The remaining byte b 
reserved for future use. 

30 FIG. 22 shows the detailed constnictkm of the mdex access structure 582. Hrst a temporary index array 

b created for each field specified as an mdexmg fieU. To do so, the database records are read 564, and an mdexed 
field data vahie findex value) b setected 566. For each new index vahie encountered 570, an entry m thb field's 
temporary index array b created, with the associated counter set to one 572. If an index value b encountered 
which b afready in the array, the associated counter for that mdex vahie b mcremented by one 574. If all indexed 

35 fields have not been processed 578, another index vahie b selected from tiie record 566. When aB records have 
been read and processed in thb manner 580, aB temporary index arrays are sorted by index value 582. 
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Continuing with FIG. 22, the next step is to convert each temporary index array to an index table. To do 
thi$r a total record count is mittaDzed to zero 5B4. Beginning with the first value m the temporary index array, the 
associated counter is added to the total record count 5BB. An access page and page entry number associated with 
this index value b computed 530. The computations are: access page • (total record count)/(65,53B x CtsI and 
5 page entry number » (total record count -1} modulo IB5336 x CsL where C,, is the initial page fiH ratio. A page 
pointer is created from the computed access page and page entry numbers 592. An index table entry is then created 
by replacmg the counter in the temporary bidex array with the page po'mter 594. If other entries in the temporary 
index array remain to be processed 59B, the next array entry b sehcted BOO and the process b repeated 5B6. The 
result b a tabte of mdex vataesr each with pointers to an empty bbck in the page table. H other temporary index 
10 arrays remain to be converted to index tabbs 602, another array b selected 604 and processed in the same manner 
5B4. 

Continuing with FIG. 22, eech block of a page tabb must next be f ilbd with record pobiters, which indicate 
the location of the compacted records contanring a specific index vahie. A database bnage record b read BOB, and 
a record pointer b constructed using the partitbn mmtber and rebthre record bcatnn of the read record 610. Next, 

15 an index value b sebcted from the read record 612, and a page pomter corresponding to the index value b read 
from the index table 614. The page entry number in the page pointer b decremented BIB. If the page entry 
number in the page pointer b negative one 620, the page number b decremented 622 and the page entry number 
b reset to the bottom of the prevbus page 624 (page entry number « 65,536 x Ce • 1). The record pomter b 
then stored m the page tabb at the location specifbd by the page pointer 626: h thb manner, each bbck of the 

20 page tabb b fiDed, from the bottom up. If other indexes remain to be processed 630, the next index field m the 
record b selected 612. When al records are processed 632, the index access structure constructbn process b 
complete 634. 

After completbn of the mdex access structure, computer system storege contains the data and tables 
shown in FIG. 23A. The trensbtbn tabbs 640, dbcussed m dete9 ebove, bchide the s'mgle-rield 642, multipb«field 

25 644, partbl 645, text phrase 647 and text deGmiter 649 translation tables. Each translation tabb contains a series 
of "slots" and a corresponding series of "numeric equhrabnts." There b one slot for every diff ermit data value in 
a singb field translation tabb 642; every different data vabe combination m a multqile-field transbtbn tabb 644; 
every different data value translated m a partbl transbtion tabb 645; every different word or phrase m a text 
phrase translation table 647; or every different delimiter in a deGmiter transbtbn tabb 649. hi each transbtion 

30 tabb, the slots are ordered based on slot size, i.e. the number of records in the slot For exampb, the number of 
occurrences of a particular data vahie in a smgb field b the size of the slot for that data value. The corresponding 
numeric equivalents for the slots are consecutive integers, starting with 0, that represent the rank of the stots. That 
b, the brgest slot has tiie numeric equivalent 0 and the next brgest slot has the numeric equivabnt 1, etc Each 
data value (data value combination, word, phrase and dersniter) in a slot b assigned the same numeric equhrabnt, 

35 and when that numeric equhrabnt b converted to a code, the code becomes a compacted data vabe. 
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RG. Z3B is an example singb-field translatnn tabte showing the sbts and the corresponding numeric 
equhralents. "VaUey" is the largest slot because it is assigned a numeric equivalent of 0. Hence, the data value 
"YaDey" occurs more frequently in this field than any other data vahie. Regardless of the method used to encode 
the numeric equivalents (variable-byte, single-byte or combination codes), the compacted data vahie for "VaDey" will 
5 be the single byte 00n» (See FIG. 23IVI, row number U Theref ore, for every record where "Valley" appears in this 
field, the corresponding compacted record wil contain the compacted data vahie OOig. 

HG. 23C shows an example multiple-field translation table. The largest slot ni this combmation of two 
fields is "Smith 8176B." As above, regardless of how the numeric equhralents are encoded, the compacted data 
value for this slot will be 00,b. Therefore, for every record where "Sndth" is in one field and "91766" is in the 
10 other, the corresponding compacted record will have a single fieU vi/ith the compacted data value 00^ 

FIG. 23D shows an example partial translation table. This table contams only numeric equhralents for the 
258 largest slots. AO other slots are unencoded. The numeric equhralents can be encoded with one byte. That is, 
the compacted data vahies win be 00,6 'or "Ship," 01 ,s 'or "InstaQ," ... and FFie for "Update." For every record 
in the original database containing a data vahie ftat b in one of the slots shown in the table, the corresponding 
15 compacted record will contain one of these compacted data vaktes. 

R6. 23E shows an exampb text phrase translation table and FIG. 23F shows a corresponding example text 
defimiter translation table. The slots in the phrase table are words such as "Radio" or phrases such as "Control 
Console." The slots in the deHmiter table are " ", "•" and ";T etc If the text fieU m one record contamed the data 
value "radio network antenna-system manuals," then the numeric equivalents for the phrases are 10, 3 and 7. The 
20 numeric equhralents for the dernniters are 0, 2 and 0. Using the coding of HG. 18B, the correspondbig sequence 
of numbers representing this data vahie is 2A238227,b. Usmg the table of HG. 18B to mterpret thb sequence, the 
"2,b" of "2A,8" indicates that thb b a word from the text translation tabte and die remaining 44iits "A,b" are the 
numeric equivalent (in thb case 10, corresponding to the phrase "radio network"). Likewise "23,8" indicates a word 
witii a numerb eqinvalent of 3, corresponding to "antenna." The delimiter " " between "radio network" and 
25 "antenna" b impGed m the sequence of numbers, because " " b the most prevalent delimiter (the largest slot) in thb 
case. The next number in the sequence b "82,b" which, according to HG. 18B, b a deloniter wfth a numeric 
equhratent of 2, which b "•" according to FIG. 23F. The final number in the sequence b "27" representing a word 
with numeric equivalent 7, which b "manuab" according to FIG. 23L 

The Mi combination table 646 indicates which fields were combmed and which fields are part of each 
30 combination. FIG. 236 b an example field combination table. Of the fields shown, the tabte mdicates that fields 
0 and 4 are not combined with other fields and that fields 1, 2 and 3 are combined togetfier. 

The mostiy numeric fields table 650 indicates which fields were subject to numeric convernon. FE 
23H IS an example mostly numeric tabte. Of the fields shown, the tabte indicates that fblds 0 and 2 are mostiy 
numeric 

35 The pattern tables 652 contain the patterns removed from each fteld, arranged according to the four bit 

value of the correspond'mg pack information table entry. FIG. 231 b an exampte pattern definition table. For 
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this Mi, there are five patterns occurring m the first three positions of the fieU. These patterns are 'ARO/ 'BRO/ 
"Bjr "JO" and "0_.- 

Stored on disk, or in main memory if sufficient space exists, is the database onage 660. The database 
image consists of partitions of storage, where each partition contains compacted records of equal length (number 
5 of bits of required storage). 

Also stored on disk, or potentially m main memory, is the record Information table 662. As discussed in 
dotal abov^ ghren the partition number of a compacted record in the database image and the value of the type byte 
appended to the compacted record, the record mformation table identifies the pack method for each field of the 
compacted record. That is, the record information table provides toformathm regarding the pattern number and 

10 storage method number for each field of the compacted record. FIG. 23J is an example record information table. 
This table shows that all compacted records contained in partition 0 with a type byte vahie of 1 are stored as 
f oDows: field 0 has no pattern and a four-byte variable byte code which represents a numeric equwalent whose 
value IS between 16843007 and 4311810304; fields 1, 17 and 24 have no pattern and a single byte numeric 
equwaient whose value is between 0 and 255; field 2 bas a suppressed pattern number 5 and a single byte numeric 

15 equwalent between 0 and 255; field 26 has no pattern and is stored in original form. 

For efficient access, there is also an index table 664 and a page table 666 for each indexmg field. An 
example index table is shown in R6. 23K, and a corresponding example page table is shown m FIG. 23L As 
discussed above, the index table consists of an unpaclced value for the indexmg field and a 32-bit page po'mter into 
the page table. The index table is sorted alpbabetically by this unpacked value. The page table consists of mult^le 

20 pages each of which are assigned a page number. Each page of the page table initially contains 52,428 record 
pomters (65,536 record pointer capacity x 80% initial fiH ratio). In a particular embodiment the fffst 8-bits of a 
page pointer correspond to the page number, the next lO-bits correspond to a record pointer number on that page 
and the last eight bits are reserved for future use. 

As shown in the example index table of FIG. 23K, the first entry is "Central" Because "Central" is the 

25 first mdex value, its page pointer is OOOODOOOtsr corresponding to page number 0 (00,8) and record pointer number 
0 (OOOQie) within page 0. The second entry of the index table is the index value "Chestnut." Assuming that there 
are 257 records in the database containing the rndex value "Central," the corresponding record pomters occupy the 
first 257 bcations of page 0 (numbers 0-256). Hence, the page pointer of "Chestnut" b OD010100,er correspondmg 
to page number 0 (00,^) and record pointer number 257 (0101,b) within page 0. Assummg that there are 52,334 

30 records m the database containing the index value "Chestnut,* the correspondmg record pointers occupy the remaming 
52,171 locations of page 0 (52,428 initial capacity • 257 locations associated with "CentraD and tiie first 163 
bcations of page 1 (numbers 0-162). Therefore, the third index value, "Frgueroa," starts on the second page (01 J 
of the page table, at location 163 (00A3,8), and its page pointer, therefore, is 0100A300|8. 

FIG. 23L shows the beginning of an example page of the page table. The first entry is "52428" 

35 corresponding to the number of entries on the page. This is followed by that number of record pointers. These 
record pointers correspond to the records in tiie database. The first 8-bits of a record pointer specify the partition 
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number of a record in the database. The remaining 24-bits specify the a record's position in that partition. For 
example, the third record pointer shown in FIG. 23L is 10000020,9. This specifies partition number IB (10,8) and 
record 33 (000020,,) of that partition. 

Referring back to FIB. 4, there are two operational subsystems for a preferred embodiment of this database 
5 system, a build subsystem 10 and an access subsystem 40. The bufld subsystem 10 converts a conventional 
database 30 residing in mass storage 14 into a database image 12, translation tables 20 (single-fieid, multiple-field, 
partial text phrase and text deMer tables), auxiGary tables 22 (field combination, mostly numeric, pattern and 
record information tablesi and access tables 24 Ondex and page access tables). A preferred embodiment of the build 
subsystem accompRshes this task h five passes, where each pass invohres reading through the entire conventional 

10 database 30 either by reading each record (an unsampled pass) or skippoig records using a sampling technique as 
descr&ed above (a sampled pass). The build subsystem constructs the database imege and al tables in mass storage 
14. The access subsystem 40 then loads the translation 20 and auxiliary 22 and access tables 24 in RAM IB prior 
to use of the database. Based on how much RAM 16 is in a particular system, the access subsystem 40 wHI leave 
the database image entirely m mass storage 14 or may load part of the database image m RAM IB. A temporary 

15 "field information array" is used to store mformation regarding the properties of each fnU. A field is marked for 
a particular compaction method by storing status data in the field mformation array. The field information array also 
has pointers (storage addresses) to the locations withm storage (mass storage or RAM) where the database image, 
translatfain tables, auxiliary tables and access tables were stored by the build subsystem. 

Advantageously, the first pass through the database is an unsampled pass. Referring to FIG. 23A, this pass 

20 creates the single fieU translation tables B42, the single-column pattern portion of the pattern tables B52 and the 
mostly numeric table 650. During this pass, fields are tested and marked (via the temporary fieU information array 
descrbed above) for smgle*field encoding, pattern suppression, numeric substitutkm and text compression; tiie 
different data vahies for fields marked for ^ngh-field encoding are determined and the dngle-fieU translation tables 
are constructed; the single-position patterns are identified and stored in the pattern tables; and die temporary index 

25 arrays are constructed, which will be used to create the index tables 664 during tiie last pass. 

If a text fieU was detected during the first pas«, a second unsampled pass will be used to identify the 
different text words and deSmiters and create tiie text debinter table 649. 

The third pass is a sampled pass which creates the partial transhtion tables 645, the fieU combinathm 
table 646, the remaining portion of the pattern tables 652 containing multiple-position patterns, and tiie text phrase 

30 translation tables 647. Database sampling is described in detaO above witii respect to tiie muftiple-fieU combming 
compaction method. Durmg this pass, fields are tested and mariced for multiple-field combining and partial 
translation; tiie different data values for fields mariced for partial translation are detemuned and tiie partial translation 
tables 645 are constructed; muhiple-positioo patterns are identified, using the sbigte-position patterns from the f tet 
pass, and stored in the pattern tables 652; text phrases are determined udng the text words identified during tiie 

35 first pass; the text words and phrases are encoded; and the text phrase tables 647 constructed. 
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The fourth pass is unsatnpled and creates the multipb-field translation tables 644. During this pas$, the 
different data value combmations from the fmlds marked for combining are identified and encoded. 

The fifth pass is the final pass and is used to create the database image 660, the record niformation table 
662, the page access tables 666 and the nidex tables 664. During the fifth pass, the compacted records are 
5 created in accordance whh the method outfined in above with respect to FIG. 16. Also, the temporary pack 
information table is generated, mdicating the patterns and storage mediod for each field of each record. The record 
information table 662 is created from tiie temporary pack information table during this pass. Each different row 
of the pack infomiation table creates a cohmin in the record information table and each such cohimn is assped a 
partition number and type byte. After the compacted records are created, the compacted records are grouped in 

10 partitions, and a type byte is appended to tiie compacted records. With this last step, the database image 860 b 
complete. The locations of compacted records within the parthions of the datebase image are used to create the 
page access tables 666, and these tables are used with the temporary index array created during the first pass to 
create the index tables 684. At this point the database buid subsystem is finished. 

Referring to FIG. 4, the access sidrsystem 40 operates to enable use of tiie database structure created by 

15 the bu3d subsystem 10. biitially, the access subsystem 40 takes the database image and tables, which at this point 
stiS reside m mass storage 14, and places the tables in RAM 18 and the database image 12 in either mass storage 
14 or in RAM 16 or spGt between botiu Dnce in tiib conf^uration, the database structure can be accessed. RG. 
23A shows the interrelationshv of the database bnage 660 and various tables during operatnn of the access 
subsystem. The data structures of the database m accordance witii the present invention are shown in FIG. 23A 

20 as boxes. The information f tow between data structures, that b, tfie data obtained from one data structure which 
b used to access data in another data structure, b shown in FIG. 23A as connecting arrows between boxes. To 
access data, an index vahe (a data vahie in an mdex field) specified by a user b used by an index tabte 664 to 
specify a page pointer 665. The page pointer 665, in turn, b used by a page table 666 to specify a block of record 
pointers 670. These record pointers 670 specify tocations of compacted records witiiin the database image 680. 

25 Compacted data vetoes 672 from the compacted records are used by the transtotton tables 640 to obtain equbatent 
data vetoes. The number of the partition containtog a conqiacted record and the value of the compacted record's 
type byte 674 specify a column of the record information tabb 682 whbh contains pattern and storage method 
numbers for each fbld. The rows of the record mfonnation tabb 662, specify field numbers 676, which are used 
by the mostly numerto fieU tabb 650 and the field combinatton tabb 646. If a particular fbld b a mostly numeric 

30 field or b combined witii other fblds, its fbU number will be found m tiiese tabbs. In turn, tiiese tables notify the 
database access system that the compacted data values 672 are ehher numeric values or represent codes assocbted 
with multipb fields m the or^mal database. Pattern numbers 680 from the record mformation tabb 662 are used 
by the pattern tabb 652 to specify particular patterns which are to be replaced to compacted data vetoes to rweate 
the ofqinal data vetoes. Storage metiiod numbers 636 from the record mformation tabb 662 are used witii the 

35 translation tables 640 to recreate data vatoes from compacted data values. The storage metiiod numbers 636, 



wo 95/32477 PCT/US95/03901 

•34. 

specify whether Fields are partial or fuDy Uanslated, whether fields are encoded with variable-byte, single-byte or 
combination codes, and the particular varbble-byte, single-byte or combination byte used. 

FIG. 24 iDustrates the process for accessmg data in the database image 700. The access begins with a 
request for data from the user of the database. The user request contains specified fmlds and data vahies. These 
5 are determined from the user r«|uest 701 If the request is for a read, delete or update 704, a search for records 
satisfying the request wBI occur. RecaD that an index fidd, fke an index for a textbook, provides an efficient means 
of accessing data. The user request might not specify an Index field, might specify a single index field or might 
specify multiple mdex fields (if more than one index field was created during the bm1d process). If an mdexfaig f»td 
B not specified by the user 70S, the data values specified must first be compacted 708 to the equwalent 

10 representations in the database image. Then the database image can be searched 710 for records satbfying the 
user request If at least one index field Is specified 708, aB data values specified by the user whhm an index field 
(called mdex values) are identified and the index vahie to be used is determined 707. As discussed above using FIG. 
1 A as an example, the "Last Name" field was made an bidex field. If a user requested data on "people whose last 
names are "Grant," then the "Last Name" field is a specified index feld and "Grant" is a specified index value. 

15 The mdex vahie used during en access b the one which will most Irnitt the search, in the case where only 

one index value is specified by the user, that index vahie is selected, bi the case where more than one index vahies 
are specified, the index which results hi the least number of records to be searched is selected. As previously 
discussed, an nidex value is contained in an index tebh with a corresponding page pointer. A page pomter specifies 
a block of record pomters that span one or more pages of at most 65,536 record pomters each. The posftion of 

20 a record pointer within a page is called its page entry number. The number of records to be searched is determined 
by the "distance" between the page posittons specified by two page pointers, the page posit»n corresponding to a 
specified index vahie and the page position corresponding to the next index vahie m the index table for that mdex 
field. The difference between two page pomters is a function of the pages they pohit to. if they point to the same 
page, the difference is the difference 01 their page entry numbers, if they do not pohit to the same page, the 

25 ififference b the total number of entries on the page pointed to by the first vahie, less the first vahie's page entry 
number added to the sum of the second vahe's page entry number and the number of entries in all intervenmg pages. 
In the special case where the last index vahie is soppGed, a "dummy" page pointer is created, pointing to one more 
than the last page entry. Tlus "dummy" entry is used as the second page pointer for number-of-record computations. 
Referring agam to H6. 24, aB fields supplied other than the selected mdex are compacted 709. The 

30 database bnage search 710 can then be accomplished usnig the selected index table directly without compacting the 
specified index vahie. If a read, delete or update were not requested 704, then a data msert is performed 712. 
If the requested access fails 716 (eg. the data requested does not exist) an error message b returned to the user 
720. A successful access b acknowledged 722. If a data read b requested and records are located 724, the 
records are transmitted to the user 726 and the access b complete 730. 

35 One type of read request is: "faid aB records in a fieU of the database with a data value equal to the one 

suppSed." That b, a field and data vahie are specified. As shown m FIG. 25, thb type of data read request begins 
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by determining the form of the specified data value as it is stored in the database image 732. If the specified field 
has a translation table 734, the specified data value is replaced with its numeric equ'nralent 736. If the field is part 
of a multiple field combination 740, the data value is replaced by a set of one or more numeric equivalents 742, 
where each numeric equhraknt corresponds to a different data value combinetion containing the specified data value 
5 in the specrTied field. If there is no numeric equivalent m a translation table associated with a field to the value 
supplied for that field, the database is not read and a "no record found" status message is sent to the user. If the 
field has a pattern nif ormation table 744 and the data vahie has a pattern 746, the pattern b removed from the 
data value 750 and the pattern number is flagged for use in pacic information identification 752. If the field is in 
the mostly numeric table 754 and tiie data value b a numeric 756, it is converted to a binary number 760. If the 

10 field has a partial translation tabb 762 and the data value b one'of the 256 entries in thb tabta 764, the data 
value b repbced with its numeric equhratant 766 and a ftag b set 770 that mdicates a partial transbtion was used, 
for pacIc informatbn identification. K the field b a text field 772, the sequence of numbers representing the text 
data value b determined 774. At thb point, the specified data value has been converted to its compacted form 776. 
Then the entire database nnage cen be searched for compacted data vahies matching the compacted data value or 

15 set of compacted data vahies calculated above. 

As shown bi RG. 26, a comptote database search 780 begins by deterroming if an indexng field b specifbd 
782. If not, the current record and partition number are mitblzed to zero 784. A compacted record b retrieved 
from the current partition/record tocation specified 786. The type byte appended to the record b read 790 and used 
to retrieve a cohnnn from the record information tabta 792. If the type byte b equal to 255, the record b ignored. 

20 The record faifonnation tabb cohnnn bdbates the tocatbn withbi the compacted record of the field of mterest 784. 
The compacted data vabe b retrbved and compared with the compacted data vahie computed above 796. If there 
b a match, the compacted record b processed 800. If no match, the current record number b mcremented 802. 
If the current record number b greater than the number of records in the current partition 804, the current record 
b set to zero 806 and the cinrent partition b mcremented by one 810. If the current partitbn b greater than the 

25 number of partitbns 812, the end of the database image b reached and the data search b concluded 814. 
Otherwbe, the next record b read 786. 

If an mdex field b specified for the data access 782, it b unnecessary to first convert the requested data 
value to compacted form, because the index table contains sorted data vahies in uncompacted form. The data 
access occurs by locating the specifbd data value in the index tabb 816. If the suppfied value b not found in the 

30 index table, the database b not read and a "no record found' status message b sent to the user. The page pobiter 
b read 820, whbh specifbs the begmnmg of a block of record pointers contabing thb mdex value. Thb fabck of 
record pointers b then used to access compacted records satisfying the data request A record pointer b read from 
the page btock 822. The record pointed to b read 824, the record b processed 826 and the page pointer b 
incremented 830 to the next record pointer in the page block. The page pointer b tested for an end of bbck 

35 condition 832. An end of bbck condition exists when the page pointer b equal to the next mdex value's page 
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pointer or the end of the database is reached. If an end of block condition is not reached, the next record printer 
b read 822. Otherwise, the access is complete 814. 

FIG. 27 shows the record processing 838 required for reads, deletes and updates. For reads 840, a record 
located during the database image search is stored for transirassion to the user 842, compteting the record 
5 processing 848. 

Other types of data reads m possible using the same methods. For instance, the read request "find aO 
records in the database bgically related to the specified data value' (eg. greater than) or "find aD records in the 
database satisfying a partial field description" (eg. all data vahies in the "Name" field starting with "S"). One of 
ordinary skill in the art will recognize how to perf omi other types of data read accesses, gnren the structure of the 

10 database image and associated tables. 

One type of data delete request is: "delete each record from the database in which the fields specified 
contain the data vahies specified." As shown in FIG. 24, the data delete is similar to the data read descrbed above, 
in that a record or records matching a specified field and data value must be found. If one or more mdexing fields 
are specified 708, the mdex vahie used to limit the search is determined 707, all other supplied values are compacted 

15 708, and the database image is searched for the records to delete 710. Otherwise, the specified data value must 
first be converted to compacted form 708. The record processing 836, once a record is found, however, is different. 
As shown in HG. 27, if the access is a data delete 844, records accessed have the appended type byte set to ^5 
to mdicate a deletion 846, complefmg the record processing 848. 

A data update changes one or more field vahies h an existbig record or records. An update request 

20 consists of data describmg the record to be updated (record constraints) and new vahies for specified fnlds. As 
shown m H6. 24, the data update is sbnaar to the data read and data delete described above, in that a record or 
records matchmg a specified field and data vahie must be located, if one or more indexing fields are specified 706, 
the index value used to Ttmit the search is determined 707, afl other supplied vahies are compacted 708, and the 
database image is searched for the records to update 710. Otherwise, the specified data vahie must first be 

25 converted to compacted form 708. As shown in FIG. 27, however, the record processmg 836 once a record is found 
is different for an update 848 from a read 842 or delete 846. 

As shown in FIG. 28, processing a record for a data update 850 begins by selecting a field within the 
record 852. Next, it is detemuned if the field selected is one of those specified to be updated 854. If not, the data 
value m the field is unchanged and directly inserted mto a temporary record 856. if the field is specified for 

30 updating, it is detemuned if the fieM is in a multiple-field comb'mation 860. For fieUs which are not contained withm 
multiple field combmations, the new data value gnren for the field is compacted 852 and inserted in the temporary 
record 856. If the field is contamed in a multiple field combination, the data vahies in the record are uncompacted 
864 and the resultmg data vahie combmation is modified with the new data value or values 866. This updated data 
value combination is then recompacted 870 and inserted in the temporary record 856. The partition number and 

35 type byte for the temporary record are then derived 872. K the partition number is the same as tiiat of the existing 
record 874, the existing record is overwritten by tiie temporary record and its type byte 876. If not, the existing 
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record's type byte b set to 255 880 to indicate its deletion, and a data insert b performed usmg the temporary 
record 882. The access tables for aB indexes are then updated 884. 

A data insert adds a complete new record to the database. As shown in FIG. 29, a data insert 900 begins 
by reading the record to insert 902. A data read is then performed 904. If the record already exists 906 and the 
5 rules associated with this database do not permit dupEcate records 910, a "duplicate record" status message is 
returned to the user 912 and the process is terminated 932. H no duplicate exists, it is determined, for every f»ld 
with a translation table 914, if the data value to insert is missing from the corresponding translation table 916. 
If the data value is not found, the value is added to the end of the translation table witii a count of one 920. The 
numeric equhralent of this data vahie is the numeric equivalent of the previous entry in the translatmn table plus one. 

10 An fields vifitii translation tables are checked in tiiis manner 922. The record is then packed 924 and placed in the 
proper partition 926. This step might create a new column in the record mformation table. If the partition 
determmed from the record information table has any deleted records, the compacted record b copbd to that 
bcation. Otherwb^ the compacted record b copbd to the end of the approprbte partitbn and the partition record 
count b incrementel Fmally, the access tables for al nifices are updated 930 and the process b terminated 932. 

15 An appGcation may require access to the database from a remote locationr that b, a bcation different from 

the host conyuter where the database data b normally stored and processed. Thb eccess b accompished by biking 
a remote device to the host computer m a convnunbations channel Thb communeations channel may be 
"wirebss," such as a radb Bnk; or it may be a conventional telephone line or similar "wired" Bnk. 

Remote access usmg a database in accordance with the present inventbn requires less total data to be 

20 transmitted across the communications channel (reduced communications channel traffic) as compared with using a 
conventional database. Conventional database data b transmitted in an uncompacted form, although such data might 
be compressed at the transmitting end of the communications channel and decompressed at the recemng end. 
Database image data, however, can be transnntted between the host computer and the remote device in compacted 
form. This compacted data can abo be compressed at tiie transmitting end of the channel and decompressed at 

25 the recehring end by conventbnal techniques. Reduced data flow across the communications channel b achbved 
by transmitting a "code" (a compacted data value) in place of the data value. Thb requires the host computer and 
the remote devbe both to "know" the code, as defined by the translation and auxBiary tables. 

In order to achbve ef fttbnt remote access of the database image, as noted above, the remote device must 
be capable of transmitting and recehring database data m compacted form. The remote device must also be capable 

30 of translating the data to and from its original f omi and its compacted form for interfacing with a remote user. The 
remote device, therefore, must store the transbtion and auxiliary tables, which are necessary to compact and 
uncompact the data. Further, the remote device must perform the associated computatbnal functions on these 
tabbs. Referring to FIG. 4, the remote device must have RAM 16 in which the translation tables 20 and the 
auxBiary tables 22 are storel The remote device must also have a CPU 46 and an associated operating system 

35 50 capabb of executing the functions of the database image access processor 42 which operate on the translatbn 
tables 20 and auxifiary tables 22. The remote device must abo have input and output devices 36 for the remote 
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user mterf ace. In cases where a remote device performs operations on a rsmted subset of the data, the remote 
device might only require a partial version of the translation and auxiliary tables. Likewise, different remote devices 
might each contain a different set of partial translation and auxiliary tables. 

Utilizing die Mierent structure of the database accordmg to the present invention allows further reductions 
5 in commuiucations traffic across the conuminicatRins channel compared with the remote access of conventional 
databases, in addition to transmitting the data values contained in a database, remote access also requires the 
transmission of storage format information. That is, the recehnng end (remote device or host computer) needs to 
know what was sent in addition to the vabe sent. Conventional database records contam'mg muhqile fields of data 
would have to be accompanied by infomration effectively describing the record length, the number of fields to the 

10 record and the storage length of each field. This may be accomplished to several ways. For example, each 
transnutted record and each field withto a record could be set to a known, fixed lengtfi by addtog filler (bytes 
contatotog no mformatton). Ahemathrely, each transmitted record could be preceded by a . header specifytog the 
record length, the number of ftUs to the record and the length of each fmli Another possibSty is to transmit 
"start" and "stop" codes between each record and between each field of each record. The fixed-length field, header 

15 and the start/stop code methods aD resuH to excess transmitted toformatton. 

The record tof ormation table reding to both the host computer and remote device contatos storage format 
toformation for al the compacted records and the fieUs within these compacted records. Thus, ghren the number 
of the partition where a compacted record b stored withto the database image and the type byte appended to the 
compacted record, the storage fonnat for aD fields of the compacted record can be determmed from the record 

20 toformation table. Therefore, only the partitton number and type byte (two bytes totaD need be transmitted with 
the compacted record to order to completely describe to the transmitted toformation to the recehrtog device. 

If only a partial compacted record is transmitted (less than all of the f tolds), a field identification 'bit map" 
also needs to be transmitted with the compacted data vetoes, partition number and type byte to order to describe 
the toformation transmitted. This bit map b simply a series of bits, where the least significant bit represents the 

25 first field of the contacted record and the most significant bit represents the last fieU of the compacted record. 
A bit b set to "1" to todicate tiiat a field b transmitted and b set to "0" to todicate that a ftold b not transmitted. 
For example, if 8 fields are to a compacted record, the fieM identification bit map "00101 IOI2" todicates that the 
1st 3rd, 4tfi and 6th fnlds are transmitted. Therefore, thb bit map and the compacted record's partition number 
and type byte are aD that needs to be transmitted with the 4 compacted data vetoes to enable the recehrtog device 

30 to toterpret thb transmitted toformatton. 

If the database toiage contatos a combtoation field (two or more f»lds processed by the multipb-ftold 
combtoation compaction method), the field identification bit map and tiie correspondtog transmitted data b sightly 
more compScated. Thb compltoation arises because either an entfre combtoation field fte. aD »ngle-fieU constituents 
of a combtoation fiekO may be transmitted or fewer than aD the constituent fields may be transmittal The ftold 

35 identification bit map needs to dbttogobh these two possibiEttos. Abo, because a constituent ftold must be 
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transmitted 'n uncompacted form fit i$ only compacted when combhied with aB other constituent fields of a 
combination WW, an indication of the amount of uncompacted data transmitted must accompany the transmission. 

To apply the field identification bit map to both single Tields and combination fields, a correlation between 
specific bits of the bit map and constituent fields of a combination field needs to be estabBshed. Hiis correlation 
5 is best ilhistrated by example. Assume that a compacted record of a specific database image contains 8 fields, 
where the 4tb and 6th fields are combination fields, the 4th field consists of 3 constituent fields and the 6th field 
consists of 4 constituent fields. The bit map has 13 Uts, 6 bits for the single fields and 7 bits for the constituent 
fields of the 2 combination fields. The 3 least significant bits of the bit rap represents the first 3 (single) fields, 
the next 3 bits of the bit map represents the 3 constituent fields of the 4th (combmation) field, the next bit 

10 represents the Sth (sbigle) field, the next 4 bits represent the 4 constituent fields of the 6th (combmation) field, and 
the 2 most significant bits of the bit map represent the 7th and 8th (single) fieUs. Thus, the bit map 
"1001 lOOIOOIOOz' indicates the 3rd single-field, the 3rd constituent field of the 4th (combined) field, the 2nd and 
3rd field of the 6th (combmed) fieU and the Sth single-field are aB transmittel For transmitted constituent fieUs 
(those that are part of a combination field which is not transmitted in its entirety), a "length byte" is sent which 

IS precedes each constituent field's data vahie. This byte is equal to the number of bytes of uncompacted data 
transndtted (up to 255 bytes), if the bits for aD constituent fields of a combination field are set to "1," this 
indicates that the combination field is transmitted In its entirety an b m compacted form. 

As noted above, data communicated between a host computer and a remote device is, for the most part 
in compacted form. This data remains compacted for aD access operations. Data is returned to its original fonn 

20 only for presentation to its user {e.g., print or display). This compacted form reduces the communication channel 
traffic compared with the traffic if the original form of the data been transmitted. In addition, certain transactions 
can be performed in their entirety at the remote device* For example, a request for data containing a field associated 
with a translation table will return "no data found" if there is no entry corresponding to the suppEed data *m that 
translation table. 

25 In some cases, it is desirable to distribute portions of a database (remote partial database copies) to remote 

locations whOe mamtaming the complete database at a central location. These ronote partial database copies need 
not be the same at each remote location. Each remote partial database copy employs two unique sets of translation 
and auxiGary tables. The requirements of each remote location define its remote partial database copy and 
translation and auxiGary tables. Use of remote partial database copies allow for most of the processing to be 

30 performed at remote locations without communications with a central location. Each remote partial database copy 
appSes the same methods of compaction and access as the database at the central location. For example, the 
company of FIGS. 1A-E has several locations as shown in FIG. 1L FIG. IB shows the company's work order 
mformation as maintained at its central location. Each remote location could have a remote partial database copy 
of this W0RK_0RDERJNF0 table which consists of only those records which contam tiie WRK LOG vahtes 

35 associated with its bcation. 
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The database structure and method of buHding the same have been disclosed in detail In connection with 
the preferred embodiments. These embodiments are disclosed by way of examples only and are not to Emit the 
scope of the present nnrention, which is defined by the claims that follow. One of ordinary sklB in the art will 
appreciate many variations and modifications within the scope of this invention. 
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WHAT IS CLAIMED tS: 

1. A database storage method, which comprises: 
reading a plurafity of records from a database; 
setectbig a Mi within the records; 

5 reading a phirarity of data values, such that one of the data values is read from the field m each of the 

records; 

determming whether a f iaU characteristic dependent on the data values in the field, satisfies a compaction 

criterion; 

applying the compaction method only if the fmid characteristic satisfies the criterion, the compaction method 
10 creating a pluraSty of compacted data values, the compacted data values bemg reduced storage equhralents of the 
data values; and 

storing the compacted data vahies in a plurality of compacted records, such that each of the compacted 
records contain one of the compacted data values. 

2. A database storage rrathod according to claim 1, wherein the compaction method is single-field 
15 encocfing, comprising the steps of: 

ass^ning one of a plurafity of codes to each of the data values, the codes being binary numbers such that 
each of the data vahies can be determined from the codes; 

creatmg a phiraEty of compacted data vahes from the data vahies by replacing the data vahies with the 
codes assigned to the data vahies. 
20 3. A database storage method according to claim 2, wherein the ass'^ning step comprises the steps 

of: 

creating a sorted Gst of the data vahies by frequency of occurrence withm the field; 

equating one of a plurality of numeric equhrabnts to each of the data vahies, each of the numeric 
equhralents being a unquo, non-nqathre integer corresponding to the rank of one of the data vahies on the Est, such 
25 that the smallest one of the numeric equivalents corresponds to the most frequently occurring one of the data vahies 
and the largest one of the numeric equivalents corresponds to the least frequently occurring one of the data values; 
and 

encoding each of the numeric equivalents by replacing the numeric equivalents with the binary numbers, the 
bhary numbers being dependent on the numeric equhralents. 
30 4. A database storage method according to claim 3, wherein: 

the field characteristic is equal to the number of different data values in the field; and 
the criterion is the field characteristic must be less than a threshold, where the threshold is equal to a 
constant of proportionanty multiplied by the number of records m the database raised to a power of less than one. 
5. A database storage method according to claim 4, wherein the constant of proportionaOty is in the 
35 range of 10 to 100 and the power is m the range of .25 to .75. 
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6. A database storage method according to clam 5, wherein the constant of proportionaBty is 31.623 
and the power is 0.5. 

7. A database storage method according to cbim 1, wherein the compaction method is multip!e*field 
condiining, comprisnig the steps of: 

5 selecting a second field from the record; 

reading a second pluraity of data vahies from the second Field; 

identifyhig a phiraBty of data value combinationSr each of which comprise one of the data values and one 
of the second data vahies which occur together in one of the records; 

assigning one of a pluraSty of codes to each of the data vahie combinations, the codes being binary 
10 numbers, such that each of the data value combmations can be determined from the codes; and 

creating the compacted data vahies from the data value combinations by replacing the data vahte 
combbiations with the assigned codes. 

8. A database storage method according to claim 1, wherein the assignmg step comprises the steps 

of: 

15 creating a sorted Est of the data value combinations by frequency of occurrence within the first and the 

second rieMs, such that most frequently occurrmg of the data vahie combinattons b First on the Est and the least 

frequently occurring data vahie combination is last on the Est; 

equating one of a phiraEty of numeric equivalents to each of the data value combinations, each of the 

numeric equivalents being a unhiue, non-negative mteger corresponding to the rank of one of the data vahie 
20 combinatrans on the Est, such that the smaDest one of the numerfe equivalents corresponds to tiie most frequently 

occorrnig one of the data vahie combinations and the largest one of the numeric equivalents corresponds to the least 

frequently occurrmg one of the data vahies; end 

encoding each of the numeric equhralents by replacing the numeric equhralents witii the bmary numbers, the 

binary numbers bemg dependent on the numeric equhralents. 
25 g. A database storage method according to claim 8, wherem: 

the field characteristics are equal to an estimated mean of the number of records per stot and a estimated 

standard deviatfam of the number of records per slot, where the estimated mean is equal to a sample mean multipfied 

by a sample interval and tiie estimated standard deviation b equal to a sample standard devbtion muhipEed by the 

square root of the sampte hiterval; and 
30 the criteria are the estimated mean must be greater than one half of a threshold, the estimated mean phis 

the estimated standard deviation must be greater than the threshold and the estimated mean must be greater than 

tiie estbnated st»idard deviation, where the tiireshoM b a constant of proportionality multipSed by tiie number of 

records m tiie database rabed to a power of less than one. 

10. A database storage method according to daim 9, whereb the constant of proportbnaEty b in the 
35 range of 0X01 to 0.01 and tiie power b in the range of 0.5 to 0.99. 
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11. A database storage method according to clabn 10, wherein the constant of proportionaGty is 
0.005B and the power is a75. 

12. A database storage method according to claim 1, wherein the compaction method is pattern 
suppression, comprisnig the steps of: 

5 identifying a pattern, the pattern besig a character which occurs repeatedly at a specific character position 

of the data vabes; 

determtning those data values which contain the pattern; 

creating a phiraGty of compacted data vidueSr the compacted data values being the data values, with the 
pattern deleted from the data vataes which contain the pattern; and 
10 associating one of a plurafity of storage method designations to each of the cnnpacted data vahies, such 

that each of the des'gnations indicate whether the pattern was deleted from one of the compacted data values 
contahed in one of the compacted records. 

13. A database storage method according to claim 12, wherein: 

the field characteristics are a per character percentage and a pattern determination percentage; and 
IS the criteria are the per character percentage must be greater than a first threshold and the pattern 

determmation percentage must be greater than a second threshold, where the fvst and second thresholds are 
constants. 

14. A database storage method accordmg to claim 13, wheren: 
the pattern can be maximum of 4 characters; 

20 the first threshold is hi the range of 5% to 20%; and the second threshold is in tin range of 40% 

to 60%. 

15. A database storege method according to claim 14, wherem the first threshold is 10% and the 
second threshold is 50%. 

16. A database storage mediod according to claim 1, wherein the compaction method is numeric 
25 substitution comprising the steps of: 

determining if any of the data values are numeric; 
computing a binary equivalent of each of the data values that are numeric; and 
creating the compacted data values by replacmg the data values that are numeric with the binary equivalent 
of the data values. 

30 17. A database storage method according to claim 10, wherein: 

the field characteristic is the percentage of records which are numeric; and 

the criterion is the field characteristic must be greater than a threshold, where the threshold is a constant 

IB. A database storage method according to claim 17, wherein tiie threshold is m tiie range 75% to 

100%. 

35 19. A database storage method according to claim IB, wherein the tiireshold is 90%. 
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20. A database storage method according to claim 1, wherein the compaction method is text 
compression, comprising the steps of: 

identifymg a plurality of text portions of the data values, the text portions selected from the group 
consistmg of a delimiter, word or phrase; 
5 assigning a phiraRty of codes to the text portions, the codes being binary numbers such that the text 

portions can be determmed from the codes; and 

creating the compacted data vahies from the data values by replacmg each of the text portions within the 
data vahies with the assigned codes. 

21. A database storage method according to claim 20, wherein the assigning step comprises the steps 

10 of: 

22. A database storage method according to daim 20, wherein: 
the field characteristic is the number of different words; and 

the criterion is the field characteristic must be less than a threshold, where the threshold is equal to a 
constant of proportionality multiplied by the number of delimiters h the field raised to a power tess than one. 
15 23. A database storage method according to claim 22, wherem the constant of proportionality is m 

the range of 10 to 100 and the power is in the range of 0.25 to 0.75. 

24. A database storage method according to claim 23, wherein the constant of proportlonarity b 
63.246 and the power is 0.5. 

25. A database storage method according to claim i wherein the step of reading a plurality of records 
20 comprises the steps of: 

determining an average sample mtervat 

reading one of the records; 

skipping at bast one of the records; and 

repeating the steps of reading one of the records and skipping until aB of the records have been processed 
25 by one of the readmg and skqiping steps, such that an average of the records skipped is equal to the average sample 
interval 

26. A database storage method, which comprises: 
reaifing a phiraFity of records from a database; 

creating a phiraBty of compacted records by applying at least one of a pluraBty of compaction methods to 
30 each of the compacted records, each of the compacted records having a record length equal to the number of bits 
hi the record; 

creating a plurality of storage partitions, the storage partitions having a designated storage length equal 
to a specific number of bits; and 

stormg the compacted records in the storage partitions to create a database enage, such that each of the 
35 compacted records is stored in one and only one of the partitions, the record length of each of the compacted 
records in a specific one of the partitions being equal to the designated storage length of the partition. 
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27. A database storage method according to claim 26, wherein each of the record lengths are an 
Integral number of 8-bit bytes. 

28. A database storage method according to claim 26, wherein each of the record lengths are an 
integral number of 2 to 7-bit nibbles. 

5 29. A database storage method accordmg to claim 26, comprismg the additional step of: 

creating a ptarafity of subpartitions within each of the partitions, such that each of the subpartitions 
contains at least one of the compacted records and has a specific storage method designattonr the storage method 
designation mdicating the compaction method used to create the compacted records within the subpartitions. 

30. A database storage method according to claim 26, wherein: 

10 the storage method designation comprises a type byte appended to each of the compacted records, a 

partition number assodated whh each of the partitions and a record information table, such that the type byte and 
partition number specify an entry m the record information table and the entry indicates the compaction methods 
used to create each of the compacted records. 

31. ' A compaction method, which comprises: 
15 reading a record of a database; 

selecting a phiraGty of fields in the record; 

reading a plurality of data vahies from the record, one of the data vahtes bemg read from each of the 
selected fields to form a data value combination; 

assigning a code to the data vahie combination, such that the code uniquely determines the data vahie 
20 combination; 

replacing the data vatae combination with the assigned code to create a compacted data value, the 
compacted data value being a reihiced storage equivalent of the data value combination; and 

creating a translation table which contains an entry that equates the data vahie combination with the 
assigned code, to provide for retranslation from the code to the data vahie combinetion. 
25 31 A compaction method, which comprises: 

selecting a Field of a database; 

readmg a phiraGty of data values from the Mi; 

Uentifymg a single-poshion pattern, the single-position pattern being a character which occurs repeatedly 
withh the field at a specific character position of the data vahies; 
30 creating a plurality of compacted data vahies from the data vahies by deleting the smgb-position pattern 

from each of the data vahies within which the smgle-position pattern occurs; 

storing the compacted data vahies in a phiraFity of compacted records; and 
associatmg with each of the compacted records a storage method designation, the des'gnation indsating 
that the compacted record contains one of the compacted data vahies with the single-position pattern deleted. 
35 33. A compaction method, whUi comprises: 

selecting a field of a database; 
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reading a phiraKty of data values from the f«ld; 

identifying a multiple-position pattern, the multiple-position pattern bemg a pluraEty of characters and a 
phiraEty of character positions in one-tonme correspondence with the charact^s, such that aB of the characters 
occur together at the corresponding character positions within more than one of the data vahie$; 
5 creafmg a plurality of compacted data vahtes from the data vahies by deleting the muttiple-position pattern 

from each of the data values within which the multiple-position pattern occurs; 

storing the compacted data vahies 'm a plurality of compacted records; and 

associating with each of the compacted records a storage method designation, the designation mdicating 
that the compacted record contabis one of the compacted data values with the multiple-position pattern deleted. 
10 34. A database encoding method, which comprises: 

selecting a field of a database; 

identifying e phraBty of data items within the field selected from the group consisting of a data vahte, a 
data value combination, a word or a phrase; 

detenrantng a relathre frequency of occurrence with'm the field for each of the data items compared with 
15 all other of the data items; 

creating a sorted list of the data items based on the relathre frequency of occurrence; 

equating one of a plurafity of numeric equhralents to each of the data items, each of the numeric 
equhralents being a mrique, non-negative integer corresponding to the rank of one of the data items on the sorted 
fist, such that the smallest one of the numeric equivatents corresponds to the most frequently occurrsig one of the 
20 data items and the largest one of the numeric equivalents corresponds to the least frequently occurring one of the 
data items; and 

assigning a phirality of codes to the data items, the codes being h one-to-one correspondence with the data 
items and dependent on the corresponding numeric equhralents, each of the codes barring a storage length that is 
at least as ^all as aB others of the codes having a smaller relathre frequency of occurrence; and _ 
25 creating a phvaEty of compacted data vahies by replacing within the field the data items with the codes. 

35. A database encoding method according to claim 34, wherein each of the codes assigned comprises: 
a binary number equal to the corresponding numeric equhralent of each of the data vahies, using the 

smallest necessary integral-number of bytes. 

36. A database encoding method accordmg to daim 34, wherein each of the codes assigned comprbes: 
30 a 1-byte binary number equal to the corresponding numeric equhralent of each of the data values modulo 

256. 

37. A database encoding method accordmg to daim 34, wherein each of the codes assigned comprises: 
a 1-byte binary number equal to the corresponifing numeric equivalent of each of the data values having 

the smallest 256 numeric equhralents; 
35 a 1-byte binary number equal to the corresponding numeric equhralent of each of the data values havmg 

the next to the smallest 256 numeric equhralents minus 256; and 
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a binary number equal to the corresponding numeric equivalent of each of the data vahies having the largest 
512 numeric equivalents, using the smallest necessary integral-number of bytes. 
' 38. A database structure which comprises, 
a pluraBty of compacted data vahies; 
5 a plurafity of compacted records each having a record length and contaming at bast one of the compacted 

data vahies; 

a phirality of partitions conqirismg a database image and contain'mg the compacted records, the partitions 
each having a designated storage length, where the designated storage length of a specific one of the partitions is 
equal to the record length of each of the compacted records contamed in the partition. 

10 39. A database structure accordmg to cbtm 38, wherein each of the ptaraDty of partithms comprises 

a plurality of subpartitions, each of the subpartitions containing at hast one of the compacted records such that each 
of the compacted records m a specific one of the subpartitions contains compacted data values created by an 
identical compaction method. 

. 40. A database structure according to claim 38, which further comprises a translation table containmg 

15 a plurality of entries, each of which equates a specific one of the compacted data values to an equhralent 
uncompacted data value. 

41. A database structure according to claim 38, which further comprises a record mforntation table 
contaming a plurality of cohimns, each of whh:h specifies a storage method for each of the compacted data vahies 
contained in one of the compacted records, each of the columns associated with one of the partitions and each of 
20 the subpartitions with one of the partitions. 

4Z A database structure according to claim 38, further comprismg an index, for effic'iently searchmg 
the database image. 

43. A database structure according to claim 38, further comprising a plurality of indices for efffeiently 
searching the database image, such that a single index that results in the least number of records to be searched 

25 can be chosen from the plurality of indices. 

44. A database structure, which comprises: 

a removable storage medium capable of being loaded into a computer system mass storage deince; 

a database image stored on the removable storage media and containing a plurality of compacted records, 
each of the compacted records having a record length and comprising a plurality of compacted data vahies; 
30 a translation table stored on the removable storage media, comprising a plurality of entries, where each of 

the entries equates one of the compacted data values with an equivalent uncompacted data vahie, such that the 
compacted data vahie can be translated into an uncompacted data value, the uncompacted data value being in a user- 
readable format 

45. A database structure according to clabn 44, wherein the removable storage medium is a CD-ROM. 
35 46. A database structure according to claim 44, wherein the database image is divided into a plurality 

of partitions, each of the partitions having a designated storage length and contaming at least one of the compacted 
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recordSr where the record length of each of the compacted records in each of the partitions is equal to the 
designated length of the partitions. 

47. A database system^ which comprises: 

a computer system, comprising a random access memory^ a mass storage memory, an input device, an 
5 output device, a processor and a bus for couplmg tite random access memory, the mass storage memory, the input 
device, tiie output device and the processor; 

an operatmg systeni, executable by the processor, for controBmg die functions of the computer system; 

a database structure, comprising a database image residing in the mass storage memory and a translation 
table residing in the random access memory; 
10 the database image comprising a phirafity of compacted records, the compacted records comprismg a 

plurality of compacted data values; 

the translation table comprising a plurafity of entries equating the compacted data vahies virith a phirafity 
of equhrabnt uncompacted data values, such that the compacted data vahies can be translated mto tiie uncompacted 
data vahies; 

15 an access subsystem, executable by tiie processor, for performing a user request entered on the input 

device, which comprises reading a subset of the compacted records from the database image, reading tiie compacted 
data vabies from these compacted records, the compacted data vahie is translated mto the uncompacted data value 
and the uncompacted data vahie is written to the output device. 

48. A database system according to claim 47, which further comprises a build subsysteni, executable 
20 by the processor, for creating the database image and the translation table from a database, the database comprising 

a plurofity of records, each of the records containing the uncompacted data value. 

49. A database system according to claim 47, wherein the build subsystem mcludes a build 
preprocessor for mputting data mto the database, the database bemg the source of the uncompacted data value from 
which the database onage and the translation table are created. 

25 50. A database system according to claim 47, further comprising: 

a remote device, comprising a second random access memory, a non-volatile memory, a second input device, 
a second output device, a second processor and a second bus for coupfing the second random access memory, the 
non-votat3e memory, the second mput device, the second output device and the second processor, for accessing data 
bi the database system at a bcation separate from ttiat of the computer system; end 

30 a communications channel connecting the remote device to the computer system. 

51. A database system according to dain 47, wher^ a partial database nnage copy reshles in the 
non-volat3e memory. 

52. A database system according to claim 47, further comprising: 
a database management system; and 

35 an uncompacted database accessible by the database management systenv such that deletesr inserts and 

updates are performed both by the access subsystem on the database bnage and by the database management 
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system on the uncompacted database, a subset of database access requests are processed only by the access 
subsystem on the database bnage and database access requests which are not in the subset are processed only by 
the database management system on the uncompacted database. 

53. A database system according to claim 47, wherein the access subsystem further comprises: 
5 a statistical analysis subprocess for readmg the ^gfe^field translation tables to derive information about 

the distribution of data vahies within fields, and for reading the muhiple-fieid translation tables to derive information 
about the relationship between combined fields; and 

a reporting subprocess for creating a report based on the informat'ion derived by the statistical analysis 
subprocess and transmitting the report to the output device. 
10 54. A database access method, comprismg: 

providing a user access request to a computer system with a mass storage memory, within which a 
database image is stored and with a random access memory, within which a translation table and a record 
informafion table is stored, the database image comprismg a pluraDty of compacted records containnig a phiraity 
of data vahies; 

15 identifying a requested field and a r^uested compacted data value regarding which mformation from the 

database image is requested; 

searchmg the database bnage for each of the compacted records which contain the requested compacted 
data vahie m the requested fieM; and 

reaiEng each of the compacted records from the database image. 
20 55. A database access method according to claim 54, whereh the identifying step comprises: 

mterpreting the user access request with the computer system to determine the requested field and a 
requested data vahie within the request field; and 

reading a requested compacted data value from the translation table based on the requested data value. 
56. A database access method according to claim 54, wherein the providing step comprises: 
25 mputtmg a user request to a remote device, the remote device compriang a non-volatDe memory in whfch 

are stored a second translation table nlentteal to the translation table and a second record mformation table identical 
to the record mformation table, the remote device being in a location separate from the computer system and 
connected to the computer system with a communications channel; 

interpret'mg the user access request with the remote device to determine the requested field and a requested 
30 data vahie witiiin the requested field; 

determming a compaction method which was appfied to tfie requested field during the build of tiie database 

onage; 

readmg a requested compacted data vahie from the second translation table based on the requested data 
value; and 

35 transmitt'mg information regarding the requested field and the requested compacted data value to the 

computer system via the communications channel 
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57. A database access method accordmg to claim 58 wherein the transmitting step comprises: 
creating a storage format identifier from the second record mformation table based on the requested field 

and the compaction method appSed to the requested data value to create the requested compacted data value; 
creating a field Identifier based on the requested field; 
5 sending from the remote device to the computer system over the communications channel a message 

comprising the requested compacted data value, the storage format identifier and the field identifier. 

58. A database access method according to daim 57, wherein the storage format identifier is a type 
byte and a partition number, such that the type byte describes a subpartition which contains the requested 
compacted data value and the partition number describes a partitkin of the database image which contains tiie 

10 subpartition. 

59. A database access method according to claim 57, wherein the field identifier is a field 
Mentification bit map comprising a phiraRty of bits such that each of the bits represent one of a plurality of fields 
of tiie database image and one of the bits corresponding to the requested field is set to mdicate that the requested 
field is transmitted. 

15 60. A database access method according to claim 57, wherein the identifying step comprises: 

recehring the message from the remote device over the communications channel to the computer system; 

and 

determming from the message the requested field and the requested compacted data vahie. 
61. A database access method according to claim 54, further comprismg: 
20 preparing a report which specifies the frequency of occurrence of tiie data values within the field, based 

on tiie translation table; and 

outputting tiie report to the user. 
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